A recent survery of data scientists by CrowdFlower found that, when it comes to challenges
Dirty data is the #1 hurdle…
My own experience leads me to agree. Access to good quality data remains a huge problem. Organizations would do well to invest in improving the quality of their data before boosting their analytics capabilities. Garbage in, garbage out.
One error I see made regularly is not to appreciate that it’s cheaper to fix problems “upstream”—e.g. at the point the data is collected. Better tools, UX and training can significantly improve the quality of the data entering your analysis ecosystem.
Too many organizations see cleaning as a process that occurs centrally, late in the collection pipeline. By then it’s often too late to fix the problem, and all you can do is discard the data. Even worse, you may not notice the problem and use the inaccurate data in your modeling.
Many errors can only be identified in context—and as the data moves further from its origin, context disappears. For instance, if a skate park employee enters the ages of a bunch of customers as being 80, a simple iPad app could ask for confirmation, based on statistical profiling.
However, if the same data were, instead, cleaned by an analyst at HQ a week later, it would be impossible to tell if that was an error or the reunion of the 1955 Olympians. And, discarding unusual, but accuate, data is going to reduce your ability to spot emerging trends, or niche markets.