It’s data science, folks. It lives and dies by the quality of the data.
Google Research recently published a paper where they argue that machine learning solutions are being undermined by a lack of focus on data quality issues. They note that
[…] data is the most under-valued and de-glamorised aspect of AI
and that data is
[…] viewed as ‘operational’ relative to the lionized work of building novel models and algorithms.
Ironically, data science arose from statisticians’ disinterest in the collection and wrangling of data. Revisiting the sins of the father, I guess.
The Google researchers point to the prevalence of data cascades—upstream events that have compounding negative effects on project outcomes.
92% of AI researchers interviewed for the study had suffered from a data cascade.
Four categories of data cascade were identified.
- Interacting with physical world brittleness
- Inadequate application-domain expertise
- Conflicting reward systems
- Poor cross-organisational documentation
All of these issues conspire to rock the very foundations of the models we increasingly rely on.
Data quality is hard to get right. It’s a much harder problem than model development. And, while the specific choice of model is often unimportant, the same is never true for the data that is fed into it.
One reason data quality to so hard to achieve and maintain is that it’s a process problem—often involving multiple organisations and stakeholders.
As the authors of the study lament,
Data quality carries an elevated significance in high-stakes AI due to its heightened downstream impact, impacting predictions like cancer detection, wildlife poaching, and loan allocations.
We need to stop fetishising algorithms at the expense of data. Tutorials on machine learning libraries and Python are smeared across the Internet. We need to promote and reward good data hygiene.
The consequences of continuing to undervalue data work are stark.
Garbage in, garbage out.