This post is an expansion of something I wrote on Twitter:
Data scientists often complain that the bulk of their work is data cleaning.
But if you see data cleaning as the work, not just an obstacle to the work, it can be interesting.
You could think of it as data pathology, a kind of analysis before the intended analysis.
Anything you have to do before you can get down to what you thought you would be doing can be irritating. And if you bid a project not anticipating preliminary difficulties, it can be costly as well.
But if you anticipate problems with dirty data, these problems can be fun to solve. They may offer more variety than the “real work.” These problems bring up some interesting questions.
- How are the data actually represented and why?
- Can you automate the cleaning process, or at least automate some of it?
- Is the the corruption in the data informative, i.e. does it reveal something in addition to the data itself?
Courses in mathematical statistics will not prepare you to answer any of these questions. Beginning data scientists may find this sort of forensic work irritating because they aren’t prepared for it. Maybe it’s a job for regular expressions rather than regression.
Or maybe they can do it well but would rather get on to something else.
Or maybe this sort of data pathology work would be more interesting with a change of attitude, seeing it as important work that requires experience and provides opportunities for creativity.
5 thoughts on “Data pathology”
i have long been skeptical of those who claim to love data science and hate data cleaning. i do not consider the time spent with the data at the lowest level and finest granularity. the effort is typically rewarded with much better understanding by the time it comes to aggregating, analysing, and modelling.
I’m in the middle of a project to thermally characterize a system containing 28 SoCs (about half are IMX6), 27 running Yocto Linux and one eCos, each of which has a die temperature sensor and additional sensors (mainly temperature, but also some voltages and currents) connected via I2C, for a grand total over 200 individual sensors within a microwave-sized chassis.
All that hardware is just the thinnest possible skin to support an enormous software base which just shifted from Awful Alpha to Bleeding Beta. Which I am not allowed to touch, as I need to measure what a customer would experience, meaning I must do my data acquisition by SSH/telnet to the chassis then and to each of the embedded processors.
All but the eCos processor run BusyBox, so I have lots of common GNU-like utilities available. But the highest-level language available is the BusyBox Bash-like shell. So I wrote a giant script that copies itself to all SoCs, then runs SoC-specific suites to harvest and report the local sensors.
The resulting text output looks like a story written by a first grader. It has gross overall structure, but the output is whatever the local utilities provided, with minimal processing to ensure fast acquisition. Which isn’t all that fast: It takes about 50 seconds. Per iteration.
I used OpenRefine to massage the messy text to a CSV file. Gotta love OpenRefine: Process the first iteration, and the other thousand are also processed. I’m using JupyterLab and Pandas to crunch the data, but the learning curve has been steeper than expected, and process has been steady but slow.
Still, each phase, from figuring out how to get the data, to writing the script, to doing environmental experiments in a thermal chamber to generate the data, then on to massaging it with OpenRefine, then to Jupyter and Pandas, has been a blast.
What I like most about Jupyter + Pandas is how easy it is to get into “flow”, effortlessly shifting between code, interactive plots, and documentation. I’ve even had fun documenting my dead-ends, mainly as Warnings To Others (seemingly my purpose in life these days). The biggest Warning is to not have the script emit plain text: 20-20 hindsight says some simple JSON would be vastly better.
Dirty data can only be prevented.
If you have to clean data after the fact, you will never get the right answers, because your notion of what is “dirty” is conditioned by the outcome that you want to get.
Information systems (unless they are simply broken) prevent the creation of syntactically dirty data — the more disruptively, the better. The general approach to preventing the creation of semantically dirty data is type theory. It is not mature enough to be automated and to give complete coverage, but it can support the activities of the humans upon whom the responsibility falls.
One additional reason data cleaning and preparation take a lot of effort is that the data is prepared to *make the analysis as easy as possible.*
Given the choice between less data prep / hard analysis or more data prep / easier analysis, most choose more data prep!
Data quality is a great, underappreciated field.
You don’t know what you don’t know.