A 2012 study of National Health Service data in the UK found that there were
- over 17000 male inpatient admissions to obstetric services
- over 8000 male inpatient admissions to gynecology
- nearly 20000 male inpatient admissions to midwifery
Before jumping to the conclusion that the UK is at the forefront of an exciting/disturbing evolutionary trend we should probably look for a simpler explanation—and it’s data coding errors.
Each procedure has an associated code and data entry errors resulted in men being assigned to female-only procedures. Obviously there are going to be all sorts of other errors, but it’s more difficult (and considerably less hilarious) to determine whether a patient actually had an ear infection when recorded as having a back problem.
This illustrates one of the biggest challenges for data science—garbage-in results in garbage-out. You can have the most sophisticated analysis algorithms available, but, if you are analyzing the wrong thing, you’ll draw the wrong conclusions.
Clearly it’s possible to perform statistical checks on the data—as the 2012 study illustrates. However, knowing that the data is wrong does little more than render it worthless. It could have, and should have, been checked at the point of entry. Simple logic in the data entry software could have checked the gender of the patient against a list of gender-specific codes and prevented the incorrect data from entering the system in the first place.
It’s always more efficient to fix data upstream.
Of course, gender checks would only catch some of the errors. Other techniques would be required for gender-neutral procedures. As I don’t know the data well enough, it’s difficult to come up with specific recommendations. However, some ideas might include the following.
- Display the description of the code when it’s entered. If you are in a general practitioner’s clinic and “neurosurgery” pops up on the screen, you might catch that.
- Allow a range of valid codes to be configured on a per terminal basis. If I’m in a gynecology department, assigning a flu code might be suspicious.
- Learn what codes are entered at a given location/terminal. If I’ve never entered a dialysis procedure before, ask me to confirm it.
- Alert me to very rare codes. What are the chances that I really have a patient with rabies in the UK?
One of the cheapest things we can do to improve the quality of data analysis is to improve the quality of data entry. Making basic checks in data entry systems is very, very far from rocket science.