An article recently published in Nature reinforces the fact that the real challenge in data science is not mastery of the technical tools, but the ability to understand and define the problem.
Researchers posed the question of whether the color of a soccer player’s skin is a factor in how many red cards (serious reprimands) he receives. Seems like a pretty straightforward analysis.
The authors sent a data set to 29 research teams and asked them to answer the question. 20 research teams found there was a difference. Nine said there was no relationship between skin color and number of red cards received. Of the 20 who found a difference, two reported that dark-skinned players were less likely to receive red cards.
So, what explains this wide variation in results? Bias? Incompetence? No. It’s down to things like choice of what data is important, selection of analysis methods, etc.
If you have the resources, it seems like the best thing to do is crowd-source your answers. Have multiple researchers do the analysis, compare/contrast the results, share the insights across the teams, redo the analyses and then accept the majority answer. Of course, if you can only afford to employ one team, you need to be aware that data science isn’t an exact science…