I’ve resisted using the term “data science,” and enjoy poking fun at it now and then, but I’ve decided it’s not such a bad label after all.
Here are some of the pros and cons of the term. (Listing “cons” first seems backward, but I’m currently leaning toward the pro side, so I thought I should conclude with it.)
The term “data scientist” is sometimes used to imply more novelty than is there. There’s not a great deal of difference between data science and statistics, though the new term is more fashionable. (Someone quipped that data science is statistics on a Mac.)
Similarly, the term data scientist is sometimes used as an excuse for ignorance, as in “I don’t understand probability and all that stuff, but I don’t need to because I’m a data scientist, not a statistician.”
The big deal about data science isn’t data but the science of drawing inferences from the data. Inference science would be a better term, in my opinion, but that term hasn’t taken off.
Data science could be a useful umbrella term for statistics, machine learning, decision theory, etc. Also, the title data scientist is rightfully associated with people who have better computational skills than statisticians typically have.
While the term data science isn’t perfect, there’s little to recommend the term statistics other than that it is well established. The root of statistics is state, as in a government. This is because statistics was first applied to the concerns of bureaucracies. The term statistics would be equivalent to governmentistics, a historically accurate but otherwise useless term.
9 thoughts on “Pros and cons of the term “data science””
I agree that “statistics” is not a helpful term. “Inference science” would really be a better name for what’s now called traditional statistics. That could help distinguish it from (and highlight its added value beyond) “estimation science,” and then both could be sub-cases of “data science.”
A lot of applied data science or applied ML (though certainly not all!) seems to omit inference in the statistics sense (summarize the precision/quality/uncertainty of your estimates) and just do estimation alone: “Look, I estimated a complicated model on a large dataset,” but with little inference beyond the occasional cross-validation. That’s fine in many cases, but inference is often useful too—it just seems to be less sexy for some reason.
Data science is 10% inference science, 10% estimation science, and 80% the science of commas and quotes.
There’s a major flip side to my comment above: When traditional statisticians do publish inferential details alongside the estimates, it’s often either horribly misinterpreted (“The p-value is above 0.05 which proves there is no effect!”) or completely ignored (“Why does the Census Bureau include all these MOE columns in the dataset? I just end up deleting them anyway…”) We have a lot of room for improvement too :)
Data science may also cover data management, i.e. what the database community does, which boundary with data mining community blurring over the past 10 years.
Interesting. Not the pros and cons I had thought of. To me they are
Con: The term is misleading. Every scientist is a “data scientist”, in that science itself implies working with data.
Pro: Attaching the term “scientist” to your title encourages scientific rigor in your work, which can’t be a bad thing.
Of course, “statistics” originally referred to the data, not the person, and a “statistician” was someone who compiled data. (It still is, in baseball.) The thing that people now want to call “data science” was “statistical inference” — drawing conclusions from data — which comes full circle. Blame abbreviation here, not misattribution.
I feel the same way about the term “data scientist”, and also about “big data”. In Web development, everyone is now self-labeling as a “full-stack developer”. It makes me want to distance myself from all these terms. Of all of them though, I think “data scientist” is the most useful. The way I look at it, a data scientist worth their salt knows statistics, machine learning *and* parallel and distributed computing.
Inferential science would be one thing, big data another. Big data is a marketing term used to sell you more IT infrastructure. In the IT world, inferential systems would be epistemological systems that turned into mythical beasts when Oracle succeeded in wrecking the business rules technology movement. Still DARPA is pushing inferential warfare, so those systems will leak out into the civilian world at some point.
Calling someone a data science implies they are doing things in a scientific manner. Clearly that is not the case with data science.
How much of doing an analysis is experimental, as in you guess your outcome and then build towards with your tools? That would be bias.
I really do not like the evolution we’ve seen in the meaning of the term “data.” Apparently, it now means large amounts of quantitative data.
Anything that says that qualitative data or small amounts of quantitive data is not really data is a problem.
Comments are closed.