It’s data science, folks. It lives and dies by the quality of the data.
Google Research recently published a paper where they argue that machine learning solutions are being undermined by a lack of focus on data quality issues. They note that
[…] data is the most under-valued and de-glamorised aspect of AI
and that data is
[…] viewed as ‘operational’ relative to the lionized work of building novel models and algorithms.
Ironically, data science arose from statisticians’ disinterest in the collection and wrangling of data. Revisiting the sins of the father, I guess.
The Google researchers point to the prevalence of data cascades—upstream events that have compounding negative effects on project outcomes.
92% of AI researchers interviewed for the study had suffered from a data cascade.
Four categories of data cascade were identified.
- Interacting with physical world brittleness
- Inadequate application-domain expertise
- Conflicting reward systems
- Poor cross-organisational documentation
All of these issues conspire to rock the very foundations of the models we increasingly rely on.
Data quality is hard to get right. It’s a much harder problem than model development. And, while the specific choice of model is often unimportant, the same is never true for the data that is fed into it.
One reason data quality to so hard to achieve and maintain is that it’s a process problem—often involving multiple organisations and stakeholders.
As the authors of the study lament,
Data quality carries an elevated significance in high-stakes AI due to its heightened downstream impact, impacting predictions like cancer detection, wildlife poaching, and loan allocations.
We need to stop fetishising algorithms at the expense of data. Tutorials on machine learning libraries and Python are smeared across the Internet. We need to promote and reward good data hygiene.
The consequences of continuing to undervalue data work are stark.
Garbage in, garbage out.
People find it difficult to intuitively gauge the level of correlation between variables.
Guess the Correlation is an 80s-style video game that lets you flex your estimation muscles.
Just be aware that it doesn’t seem to present negative correlations, so you’ll have to intuit those elsewhere.
Why is it so hard to find effective data science courses? For instance, courses that cover the practical work involved in going from problem to solution?
This is a question I was asked last week. To answer it, we can employ basic mathematics.
Draw a Venn diagram. Include the following sets.
- people who are good at statistics
- people who are good at coding
- people who have good social skills
- people who have good teaching skills
- people who have good technical writing skills
- people who couldn’t be making a shitload more money doing something else
Count the number of people in the intersection of those sets. That’s why.
April is Mathematics & Statistics Awareness Month.
Let’s celebrate it by making sure we embrace statistics in our data science projects. It’s not just about Python, folks!