Data science and statistics

Prolific R developer Hadley Wickham provided an interesting perspective on data science and statistics in a recent Priceonomics article.

There are definitely some academic statisticians who just don’t understand why what I do is statistics, but basically I think they are all wrong. What I do is fundamentally statistics. The fact that data science exists as a field is a colossal failure of statistics. To me, that is what statistics is all about. It is gaining insight from data using modelling and visualization. Data munging and manipulation is hard and statistics has just said that’s not our domain.

This insight is at the heart of why the only way to get good at data science is to do it. Obtaining and preparing data prior to analysis is the bulk of a data scientist’s work. But, it’s not a simple concept that you can tie up with a nice neat bow. It’s a messy, convoluted process involving

trial and error
multiple, incompatible tools
missing information
organizational silos
quality issues
etc

It’s very difficult to cover this kind of stuff in a book chapter, or a traditional lecture. It’s like trying to teach automotive maintenance without putting on overalls—all makes perfect sense until you attempt to change the pistons.

Share this: