Jon Udell’s latest Interviews with Innovators podcast features Randall Julian of Indigo BioSystems. I found this episode particularly interesting because it deals with issues I have some experience with.
The problems in managing biological data begin with how to store the raw experiment data. As Julian says
… without buying into all the hype around semantic web and so on, you would argue that a flexible schema makes more sense in a knowledge gathering or knowledge generation context than a fixed schema does.
So you need something less rigid than a relational database and something with more structure than a set of Excel spreadsheets. That’s not easy, and I don’t know whether anyone has come up with an optimal solution yet. Julian said that he has seen many attempts to put vast amounts of biological data into a rigid relational database schema but hasn’t seen this approach succeed yet. My experience has been similar.
Representing raw experimental data isn’t enough. In fact, that’s the easy part. As Jon Udell comments during the interview
It’s easy to represent data. It’s hard to represent the experiment.
That is, the data must come with ample context to make sense of the data. Julian comments that without this context, the data may as well be a list of zip codes. And not only must you capture experimental context, you must describe the analysis done to the data. (See, for example, this post about researchers making up their own rules of probability.)
Julian comments on how electronic data management is not nearly as common as someone unfamiliar with medical informatics might expect.
So right now maybe 50% of the clinical trials in the world are done using electronic data capture technology. … that’s the thing that maybe people don’t understand about health care and the life sciences in general is that there is still a huge amount of paper out there.
Part of the reason for so much paper goes back to the belief that one must choose between highly normalized relational data stores and unstructured files. Given a choice between inflexible bureaucracy and chaos, many people choose chaos. It may work about as well, and it’s much cheaper to implement. I’ve seen both extremes. I’ve also been part of a project using a flexible but structured approach that worked quite well.