Rise and Fall of the Third Normal Form

relational database

The ideas for relational databases were worked out in the 1970s and the first commercial implementations appeared around 1980. By the 1990s relational databases were the dominant way to store data. There were some non-relational databases in use, but these were not popular. Hierarchical databases seemed quaint, clinging to pre-relational approaches that had been deemed inferior by the march of progress. Object databases just seemed weird.

Now the tables have turned. Relational databases are still dominant, but all the cool kids are using NoSQL databases. A few years ago the implicit assumption was that nearly all data lived in a relational database, now you hear statements such as “Relational databases are still the best approach for some kinds of projects,” implying that such projects are a small minority. Some of the praise for NoSQL databases is hype, but some is deserved: there are numerous successful applications using NoSQL databases because they need to, not because they want to be cool.

So why the shift to NoSQL, and why now? I’m not an expert in this area, but I’ll repeat some of the explanations that I’ve heard that sound plausible.

  1. Relational databases were designed in an era of small, expensive storage media. They were designed to conserve a resource that is now abundant. Non-relational databases may be less conservative with storage.
  2. Relational databases were designed for usage scenarios that not all applications follow. In particular, they were designed to make writing easier than reading. But its not uncommon for a web application to do 10,000 reads for every write.
  3. The scalability of relational database transactions is fundamentally limited by Brewer’s CAP theorem. It says that you can’t have consistency, availability and partition tolerance all in one distributed system. You have to pick two out of three.
  4. Part of the justification for relational databases was that multiple applications can share the same data by going directly to the same tables. Now applications share data through APIs rather through tables. With n-tier architecture, applications don’t access their own data directly through tables, much less another application’s data.

The object oriented worldview of most application developers maps more easily to NoSQL databases than to relational databases. But in the past, this objection was brushed off.  A manager might say “I don’t care if it takes a lot of work for you to map your objects to tables. Data must be stored in tables.”

And why must data be stored in tables? One reason would be consistency, but #3 above says you’ll have to relax your ideas of consistency if you want availability and partition tolerance. Another reason would be in order to share data with other applications, but #4 explains why that isn’t necessary. Still another reason would be that the relational model has a theoretical basis. But so do NoSQL databases, or as Erik Meijer calls them, CoSQL databases.

I’m not an advocate of SQL or NoSQL databases. Each has its place. A few years ago developers assumed that nearly all data was best stored in a relational database. That was a mistake, just as it would be a mistake now to assume that all data should now move to a non-relational database.

More database posts

Life lessons from functional programming

Functional programming languages are either strict or lazy. Strict languages evaluate all arguments to a function before calling it. Lazy languages don’t evaluate arguments until necessary, which may be never. Strict languages evaluate arguments just-in-case, lazy languages evaluate them just-in-time.

Lazy languages are called lazy because they avoid doing work. Or rather they avoid doing unnecessary work. People who avoid doing unnecessary work are sometimes called lazy, too, though unfairly so. In both cases, the term “lazy” can be misleading.

Imagine two computers, L and S. L is running a queue of lazy programs and S a queue of strict programs. L may do less work per program by avoiding pointless calculations. But it may accomplish more productive work than S because the time it saves executing each individual program is used to start running the next program sooner. In this case the lazy computer is getting more productive work done. It seems unfair to call the computer that’s accomplishing more the “lazy” computer. This also could apply to people. Sometimes the people who don’t appear to be working so hard are the ones who accomplish more.

A person who approaches life like a strict programming language is not very smart, and even lazy in the sense of not exercising judgment. Conversely, someone who bypasses an unnecessary task to move on to necessary one, maybe one more difficult than the one skipped over, should hardly be called lazy.

Related posts

Remembering COM

In the late ’90s I thought COM (Microsoft’s Component Object Model) was the way of the future. The whole architecture starting with the IUnknown interface was very elegant. And to hear Don Box explain it, COM was almost inevitable.

I was reminded of COM when I saw the slides for Kevlin Henney’s presentation Worse is better, for better or for worse. Henney quotes William Cook’s paper that has this to say about COM:

To me, the prohibition of inspecting the representation of other objects is one of the defining characteristics of object oriented programming.  …

One of the most pure object-oriented programming models yet defined is the Component Object Model (COM). It enforces all of these principles rigorously. Programming in COM is very flexible and powerful as a result.

And yet programming COM was painful. I wondered at the time why something so elegant in theory was so difficult in practice.  I have some ideas on this, but I haven’t thought through them enough to write them down.

Just as I was starting to grok COM, someone from Microsoft told me “COM is dead” and hinted at a whole new approach to software development that would be coming out of Microsoft, what we now call .NET.

COM had some good ideas, and some of these ideas have been reborn in WinRT. I’ve toyed with the idea of a blog post like “Lessons from COM,” but I doubt I’ll ever write that. This post is probably as close as I’ll get.

By the way, this post speaks of COM in the past tense as is conventional: we speak of technologies in the past tense, not when they disappear, but when they become unfashionable. Countless COM objects are running on countless machines, including on Mac and Unix systems, but COM has definitely fallen out of fashion.

More tech posts

Defensible software

It’s not enough for software to be correct. It has to be defensible.

I’m not thinking of defending against malicious hackers. I’m thinking about defending against sincere critics. I can’t count how many times someone was absolutely convinced that software I had a hand in was wrong when it in fact it was performing as designed.

In order to defend software, you have to understand what it does. Not just one little piece of it, but the whole system. You need to understand it better than the people who commissioned it: the presumed errors may stem from unforeseen consequences of the specification.

Related post: The buck stops with the programmer

Big data cube

Erik Meijer’s paper Your Mouse is a Database has an interesting illustration of “The Big Data Cube” using three axes to classify databases.

The volume axis is big vs. small, or perhaps better, open vs. closed. Relational databases can be large, and non-relational databases can be small. But the relational database model is closed in the sense that “it assumes a closed world that is under full control by the database.”

The velocity axis is (synchronous) pull vs. (asynchronous) push. The variety axis captures whether data is stored by foreign-key/primary-key relations or key-value pairs.

Here are the corners identified by the paper:

  • Traditional RDBMS (small, pull, fk/pk)
  • Hadoop HBase (big, pull, fk/pk)
  • Object/relational mappers (small, pull, k/v)
  • LINQ to Objects (big, pull, k/v)
  • Reactive Extensions (big, push, k/v)

How would you fill in the three corners not listed above?

Related links:

Pipelines and whirlpools

Walter Bright made an interesting point in his talk at GOTO Aarhus this afternoon. He says that software developers like to think of their code as pipelines. Data comes in, goes through an assembly line-like series of processing steps, and comes out the end.

input  -> step1 -> step 2 -> step3 -> output

And yet code hardly ever looks like that. Most software looks more like a whirlpool than a pipeline. Data swirls around in loops before going down the drain.

Arthur Rackham’s illustration to Poe’s Descent Into the Maelstrom.

These loops make it hard to identify parts that can be extracted into encapsulated components that can be snapped together or swapped out. You can draw a box around a piece of an assembly line and identify it as a component. But you can’t draw a box around a portion of a whirlpool and identify it as something than can be swapped out like a black box.

It is possible to write pipe-and-filter style apps, but the vast majority of programmers find loops more natural to write. And according to Walter Bright, it takes the combination of a fair number of programming language features to pull it off well, features which his D language has. I recommend watching his talk when the conference videos are posted. (Update: This article contains much of the material as the conference talk.)

I’ve been thinking about pipeline-style programming lately, and wondering why it’s so much harder than it looks. It’s easy to find examples of Unix shell scripts that solve some problem by snapping a handful of utilities together with pipes. And yet when I try to write my own software like that, especially outside the context of a shell, it’s not as easy as it looks. Brian Marick has an extended example in his book that takes a problem that appears to require loops and branching logic, but that can be turned into a pipeline. I haven’t grokked his example yet; I want to go back and look at it in light of today’s talk.

Related posts

Dialing back the cleverness

Last night I ran into Damian Conway at a speaker’s dinner for this week’s GOTO conference. He’s one of the people I had in mind when I said I enjoy hearing from the Perl community even though I don’t use Perl.

We got to talking about Norris’ number, the amount of code an untrained programmer can write before hitting a brick wall. Damian pointed out that there’s something akin to the Norris’ number for skilled programmers. He said that a few years ago he took this quote from Brian Kernighan to heart:

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.

He said he’d realized that he’d written some very complex code that was approaching his level of cleverness and that couldn’t move forward himself, much less get anyone else to take it ownership. “That’s when I decided to dial back the cleverness a bit.”

Related post: Two kinds of software challenges

Russian novel programming

One of the things that makes Russian novels hard to read, at least for Americans, is that characters have multiple names. For example, in The Brothers Karamazov, Alexei Fyodorovich Karamazov is also called Alyosha, Alyoshka, Alyoshenka, Alyoshechka, Alexeichik, Lyosha, and Lyoshenka.

Russian novel programming is the anti-pattern of one thing having many names. For a given program, you may have a location in version control, a location on your hard drive, a project name, a name for the program executable, etc. Each of these may contain slight differences in the same name. Or major differences. For historical reasons, the code for foo.exe is in a project named ‘bar’, under a path named …

I thought about this today when looking into a question about a program. A single number had different names in several different contexts. There’s the text label on the desktop user interface, the name of the C# variable that captures the user input, the name of the corresponding C++ variable when the user input is passed to the back-end numeric code, and the name of used in the XML file that serializes the variable when it goes between a database and a web server. Of course these should all be coordinated, but there were understandable historical reasons for how things got into this state.

Related posts

 

No technology can ever be too arcane

In this fake interview, Linux creator Linus Torvalds says Linux has gotten too easy to use and that’s why people use Git:

Git has taken over where Linux left off separating the geeks into know-nothings and know-it-alls. I didn’t really expect anyone to use it because it’s so hard to use, but that turns out to be its big appeal. No technology can ever be too arcane or complicated for the black t-shirt crowd.

Emphasis added.

Note: If you want to leave a comment saying Linux or Git really aren’t hard to use, lighten up. This is satire.

Related post: Advanced or just obscure?