Looking at Your Data

What to do first after scoping out and starting a data science project?

I’ve started an unsupervised learning project based on textual data. The first thing I like to do is actually look at the data. Is it noisy? What are the features—complex feature engineering needed? How heterogeneous? What generalization and overfitting challenges?

Analysis can take many forms: actually looking at the numbers, using visualization tools, Excel spreadsheet, Jupyter notebooks with Matplotlib, computing various statistics on the whole dataset or portions of it.

Some may believe this is not important. Just throw a barrage of classification or regression methods at the data, treat the data as a black box. Of course testing on a suite of ML methods is not a bad thing. But I can’t imagine not using every avenue available, including looking at the data. I’m certainly not alone in this view (see for example herehere and here).

I spent a few hours developing a simple custom data viewer for my problem that colored different parts of the textual data to give insight as to what was going on. I used ChatGPT to develop parts of this tool; some of it was incorrect and needed fixing, but having at least a draft of the code definitely saved time. Seeing the actual data in person was insightful and generated ideas for solving the problem.

While inspecting the data can help identify issues, it also risks biasing the modeling process by imposing assumptions that a flexible model might otherwise uncover on its own. One must also beware of data leakage. That being said—in general I think understanding as much as you can about the data is not a bad thing.

3 thoughts on “Looking at Your Data

  1. In the statistical significance testing controversy context of the past ~10-15 years (and the downstream “replication crisis”), this suggestion actually can be quite schizoid.

    It ostensibly seems obvious to look at the data, to make sure whatever one will do with it makes sense. But the idea that work done to the data depends on looks at it are a big part of what dooms a significance test. At least in a strict, rigorous sense.

    In these cases, strong delineating of the scope of possible work before the data looks goes a long way to help. And not all statistical contexts have this issue, of course.

    I am not aware of any other scientific field/context where pre-looking at the data is as problematic. Though I may have missed some — your exposure to the quant fields may be more broad than mine, so I’d be interested to hear of other cases.

  2. Thanks for your comment. I think the difference is, for my project I’m only concerned with the accuracy of approximating the clusters, with no interest in drawing any statistical conclusions. The approximation error of the clusters only influences the hit/miss rate of the method in production, and therefore the failure rate which dictates how often the more expensive failover method must be used. Agreed that for drawing supportable statistical conclusions, close attention must be paid to the methodology issues—and yes, for sure replicability is in crisis.

  3. Derrick Roberts

    I can’t begin to tell you how refreshing it was to read this post. As a data scientist, I spend considerable time and effort on visualization, as it is my favorite part of the process. I don’t often see examples of others who take the time to tailor the code for a given data set to create a powerful visualization. I’m not sure why – not only is it useful and satisfying, but also beautiful.

Comments are closed.