Missing data throws a monkey wrench into otherwise elegant plans. Yesterday’s post on genetic sequence data illustrates this point. DNA sequences consist of four bases, but we need to make provision for storing a fifth value for unknowns. If you know there’s a base in a particular position, but you don’t know what its value is, it’s important to record this unknown value to avoid throwing off the alignment of the sequence.
There are endless debates over how to handle missing data because missing data is a dilemma to be managed rather than a problem to be solved. (See Problems vs Dilemmas.)
It’s simply a fact of life that data will be incomplete. The debate stems from how to represent and handle missingness. Maybe the lowest level of a software application represents missing data and the highest uses complete data only. At what level are the missing values removed and how they are removed depends very much on context.
A naive approach to missing data is to not allow it. We’ve all used software that demands that we enter a value for some field whether a value exists or not. Maybe you have to enter a middle name, even though you don’t have a middle name. Or maybe you have to enter your grandfather’s name even though you don’t know his name.
Note that the two examples above illustrate two kinds of missing data: one kind does not exist, while the other certainly exists but is unknown. In practice there are entire taxonomies of missing data. Is in unknown or non-existent? If it is unknown, why is it unknown? If it does not exist, why doesn’t it?
There can be information in missing information. For example, suppose a clinical trial tracks how long people survive after a given treatment. You won’t have complete data until everyone in the study has died. In the mean time, their date of death is missing. If someone’s date of death is missing because they’re still alive, that’s information: you know they’ve survived at least until the current point in time. If someone’s date of death is missing because they were lost to followup, i.e. they dropped out of the study and you lost contact with them, that’s different.
The simplest approach to missing data is throw it away. That can be acceptable in some circumstances, particularly if the amount of missing data is small. But simply discarding missing data can be disastrous. In wide data, data with many different fields per subject, maybe none of your data is complete. Maybe there are many columns and every row is missing something in at least one column.
Throwing away incomplete data can be inefficient or misleading. In the survival study example above, throwing out missing data would give you a very pessimistic assessment of the treatment. The people who lived the longest would be excluded precisely because they’re still living! Your analysis would be based only on those who died shortly after treatment.
Analysis of data with missing values is a world unto itself. It seems paradoxical at first to devise ways to squeeze information out of data that isn’t there. But there are many ways to do just that, each with pros and cons. There are subtle ways to infer the missing values, while also accounting for the fact that these values have been inferred. If done poorly, this can increase bias, but if done well it decreases bias.
Analysis techniques that account for missing data are more complicated than techniques that do not. But they are worth the effort if throwing away missing data would leave you with too little data or give you misleading results. If you’re not concerned about the former, perhaps you should be concerned about the latter. The bias introduced by discarding incomplete data could be hard to foresee until you’ve analyzed the data properly accounting for missing values.
Very important distinctions you’re making here. I’ve found that when working with organically / rolling incomplete data (such as retroactively populating a repository of knowledge as information is discovered/recovered), it is often useful to make the distinction you’ve touched on – what C.J. Date at one time called “missing but applicable” (we known Bob has a birthday, but we don’t know what it is and may never know), vs “missing but inapplicable” (Bob is known to have no middle name).
From a database perspective, treating both of these as NULLs, but with an additional field to track what “flavor” of NULL they are, allows us to perform NULL-ignoring actions like averaging, etc, while still being able to present only the *actionable* incomplete information to users – as potential contributors.
For example, we might not need an application to ask, “Do you know Bob’s middle name?” – but we *do* want it to ask, “Do you know Bob’s birthday?”