Prolific R developer Hadley Wickham provided an interesting perspective on data science and statistics in a recent Priceonomics article.
There are definitely some academic statisticians who just don’t understand why what I do is statistics, but basically I think they are all wrong. What I do is fundamentally statistics. The fact that data science exists as a field is a colossal failure of statistics. To me, that is what statistics is all about. It is gaining insight from data using modelling and visualization. Data munging and manipulation is hard and statistics has just said that’s not our domain.
This insight is at the heart of why the only way to get good at data science is to do it. Obtaining and preparing data prior to analysis is the bulk of a data scientist’s work. But, it’s not a simple concept that you can tie up with a nice neat bow. It’s a messy, convoluted process involving
- trial and error
- multiple, incompatible tools
- missing information
- organizational silos
- quality issues
- etc
It’s very difficult to cover this kind of stuff in a book chapter, or a traditional lecture. It’s like trying to teach automotive maintenance without putting on overalls—all makes perfect sense until you attempt to change the pistons.