This post is an expansion of something I wrote on Twitter:
Data scientists often complain that the bulk of their work is data cleaning.
But if you see data cleaning as the work, not just an obstacle to the work, it can be interesting.
You could think of it as data pathology, a kind of analysis before the intended analysis.
Anything you have to do before you can get down to what you thought you would be doing can be irritating. And if you bid a project not anticipating preliminary difficulties, it can be costly as well.
But if you anticipate problems with dirty data, these problems can be fun to solve. They may offer more variety than the “real work.” These problems bring up some interesting questions.
- How are the data actually represented and why?
- Can you automate the cleaning process, or at least automate some of it?
- Is the the corruption in the data informative, i.e. does it reveal something in addition to the data itself?
Courses in mathematical statistics will not prepare you to answer any of these questions. Beginning data scientists may find this sort of forensic work irritating because they aren’t prepared for it. Maybe it’s a job for regular expressions rather than regression.
Or maybe they can do it well but would rather get on to something else.
Or maybe this sort of data pathology work would be more interesting with a change of attitude, seeing it as important work that requires experience and provides opportunities for creativity.