Managing biological data

Jon Udell’s latest Interviews with Innovators podcast features Randall Julian of Indigo BioSystems. I found this episode particularly interesting because it deals with issues I have some experience with.

The problems in managing biological data begin with how to store the raw experiment data. As Julian says

… without buying into all the hype around semantic web and so on, you would argue that a flexible schema makes more sense in a knowledge gathering or knowledge generation context than a fixed schema does.

So you need something less rigid than a relational database and something with more structure than a set of Excel spreadsheets. That’s not easy, and I don’t know whether anyone has come up with an optimal solution yet. Julian said that he has seen many attempts to put vast amounts of biological data into a rigid relational database schema but hasn’t seen this approach succeed yet. My experience has been similar.

Representing raw experimental data isn’t enough. In fact, that’s the easy part. As Jon Udell comments during the interview

It’s easy to represent data. It’s hard to represent the experiment.

That is, the data must come with ample context to make sense of the data. Julian comments that without this context, the data may as well be a list of zip codes. And not only must you capture experimental context, you must describe the analysis done to the data. (See, for example, this post about researchers making up their own rules of probability.)

Julian comments on how electronic data management is not nearly as common as someone unfamiliar with medical informatics might expect.

So right now maybe 50% of the clinical trials in the world are done using electronic data capture technology. … that’s the thing that maybe people don’t understand about health care and the life sciences in general is that there is still a huge amount of paper out there.

Part of the reason for so much paper goes back to the belief that one must choose between highly normalized relational data stores and unstructured files. Given a choice between inflexible bureaucracy and chaos, many people choose chaos. It may work about as well, and it’s much cheaper to implement. I’ve seen both extremes. I’ve also been part of a project using a flexible but structured approach that worked quite well.

Related posts

9 thoughts on “Managing biological data

  1. That remark about being “hard to represent the experiment” makes me think of a question I’ve had about bio- and medical informatics for a while now: Just how much science does one have to know in order to be a good bio- or medical informaticist?

    I’ve been kicking around possibly someday doing a MS in bioinformatice; I’ve looked into some degree programs in informatics and have been turned back by the sheer amount of biology courses in them. I enjoy biology, but to get a MS in bioinformatics I would need to take something like 30 hours in biology and chemistry in some of these programs, and that’s a lot of time and money spent there. But there are other programs that are just informatics with no science requirements. These are appealing for logistical and cost reasons, but I wonder how well they prepare somebody really to work in the discipline.

    Anybody with some expertise here have a comment about that?

  2. You may need a lot of biology to get a degree in bioinformatics, but you don’t need to know a lot of biology to do research in bioinformatics.

    Some bioinformatics researchers have a substantial background in biology, but many do not. It’s often possible to learn what you need to know just-in-time.

  3. I was at the uk e-science meeting last week, and much of the discussions revolved around data. The nice folk at myexperiment.org are starting to prote
    the idea of a “research object”. This would be a graph of artefacts deposited in a repository that represented the many interralted bits of an experement, part data, part notebook, part article. You could even expose it
    in rdf if you wished. I think the idea has a lot
    of promise, and could help
    with the data-experiment issue you mention above.

  4. How about sort of giving up on abstracting the data format and semantics too much, and store the data together with snippets of code that can read, analyze and export the relevant information from the data dump? Could be Perl/Python/Ruby for instance, or R or Matlab modules, but something runnable that verifiably does what it says, and runs in a standard environment. The code is both the analysis and the documentation of the process and the data format itself. Sort of object orientation, but from a data point of view.

  5. Yes, what jannne said, the important thing is to have as much of the support objects with the data as possible. It would be good to also be able to
    maintain the relationships, “script
    a runs on data set b”, “conclusion z
    is drawn from step y”, but
    beyond that the tools
    used to represent and
    store these relationships are an open question. Myexperent supports upload of workflows such as from taverna. They aslo expose the relationships between the parts of a research object with te repository standard oai-ore which lends itself to being represented in rdf, but you could look to extending a federation scheme like
    sword, that’s built on top of atom, and my favourite new tool to
    try to adapt to this mixing of data and logic is google wave. All of
    these considerations sadly remain moot, as the vast majority of
    scientific data (if counted by experent and not data volume) is stored
    in excel files. That tends to be because big
    science projects get funding to take care of
    data citation and they can invest in doing it properly (thin of the vast amount of software that supports the LHC). Normal science, by contrast, tends to leave researchers to their own devices. Excel is
    not as good as data purists would wish, but it is very powerful,
    does the job for the most part and in the few cases where a bench scientist might actually want to share their data it’s very easy to do with excell.

  6. Can you comment or blog in more details about the project where you used “a flexible but structured approach that worked quite well.”

  7. David, I managed a project for clinical trial data management framework that uses an EAV (entity, attribute, value) database approach on the back-end. It was optimized for rapid development of data entry applications, and made some deliberate trade-offs on other criteria. We got up and running quickly. We were collecting data and learning from client feedback while we still would have been drawing database schema on whiteboards if we’d used a more traditional approach.

    I was also part of an effort to create a relational database for microarray experiment data. Like the projects Randall Julian refers to, this project folded and was replaced by something more like a document management system.

Comments are closed.