Quality over quantity

Whatever is true, whatever is honorable, whatever is just, whatever is pure, whatever is lovely, whatever is commendable, if there is any excellence, if there is anything worthy of praise, think about these things.” — Philippians 4:8

Ninety percent of everything is crud.” — Theodore Sturgeon [1]


I often think about quality and quantity. It’s so easy, particularly in America, to get sucked into substituting quantity for quality. For example, it’s how we eat. Striving for quality over quantity sounds good, but it’s not easy. It helps to have periodic reminders to go against the stream and pursue quality. Yesterday I got such a reminder at Edward Tufte’s one-day course in Houston.

The course emphasizes eliminating frills and administrative debris to make room for high quality displays of information. The course teaches and demonstrates a commitment to quality. At one point Tufte spoke more generally and more personally about pursuing quality over quantity.

He said most papers are not worth reading and that he learned early on to concentrate on the great papers, maybe one in 500, that are worth reading and rereading rather than trying to “keep up with the literature.” He also explained how over time he has concentrated more on showcasing excellent work than on criticizing bad work. You can see this in the progression from his first book to his latest. (Criticizing bad work is important too, but you’ll have to read his early books to find more of that. He won’t spend as much time talking about it in his course.) That reminded me of Jesse Robbins’ line: “Don’t fight stupid. You are better than that. Make more awesome.”

* * *

[1] Sturgeon’s law is usually stated as “Ninety percent of everything is crap,” though that’s not what he said. The original quip was “Sure, 90% of science fiction is crud. That’s because 90% of everything is crud.”

Probability is subtle

When I was in college, I overheard two senior faculty arguing over an undergraduate probability homework assignment. This seemed very strange. It occurred to me that I’d never seen faculty argue over something elementary before, and I couldn’t imagine an argument over, say, a calculus homework problem. Professors might forget how to do a calculus problem, or make a mistake in a calculation, but you wouldn’t see two professors defending incompatible solutions.

Intuitive discussions of probability are very likely to be wrong. Experts know this. They’ll say things like “I imagine the answer is around this, but I’d have to go through the calculations to be sure.” Probability is not like physics where you can usually get within an order of magnitude of a correct answer without formal calculation. Probabilistic intuition doesn’t take you as far as physical intuition.

* * *

For daily posts on probability, follow @ProbFact on Twitter.

ProbFact twitter icon

What key has 30 sharps?

Musical keys typically have 0 to 7 sharps or flats, but we can imagine adding any number of sharps or flats.

When you go up a fifth (seven half steps) you add a sharp. For example, the key of C has no sharps or flats, G has one sharp, D has two, etc. Starting from C and adding 30 sharps means going up 30*7 half-steps. Musical notes operate modulo 12 since there are 12 half-steps in an octave. 30*7 is congruent to 6 modulo 12, and six half-steps up from C is F#. So the key with 30 sharps would be the same pitches as F#.

But the key wouldn’t be called F#. It would be D quadruple sharp! I’ll explain below.

Sharps are added in the order F, C, G, D, A, E, B, and the name of key is a half step higher than the last sharp. For example, the key with three sharps is A, and the notes that are sharp are F#, C#, and G#.

In the key of C#, all seven notes are sharp. Now what happens if we add one more sharp? We start over and start adding more sharps in the same order. F was already sharp, and now it would be double sharp. So the key with eight sharps is G#. Everything is sharp except F, which is double sharp.

In a key with 28 sharps, we’ve cycled through F, C, G, D, A, E, and B four times. Everything is quadruple sharp. To add two more sharps, we sharpen F and C one more time, making them quintuple sharp. The note one half-step higher than C quintuple sharp is D quadruple sharp, which is enharmonic with F#.

You could repeat this exercise with flats. Going up a forth (five half-steps) adds a flat. Or you could think of a flat as a negative sharp.

Related posts:

Braille, Unicode, and Binary

Braille characters live in a 4×2 matrix. This means there are eight positions where the surface is either flat or raised. You can naturally denote a Braille character by an 8-bit binary number: the bit for a single position is either 0 for flat and 1 for raised.

This is how Braille characters are encoded in Unicode. Braille characters are U+2800 through U+28FF, 2800 plus the binary number corresponding to the pattern of dots. However, there’s one surprise: the dots are numbered irregularly as indicated below:

1 4
2 5
3 6
7 8

Historically Braille had six cells, a 3×2 matrix, and the numbering made more sense: consecutive numbers, by column, left to right, the way Fortran stores matrices:

1 4
2 5
3 6

But when Braille was extended to a 4×2 matrix, the new positions were labeled 7 and 8 so as not to rename the previous positions.

The numbered positions above correspond to the last eight bits of the Unicode character, from right to left. That is, position 1 determines the least significant bit and position 8 determines the 8th bit from the end.

For example, here is Unicode character U+288A:

Braille character U+288A

The dots that are filled in correspond to positions 2, 4, and 8, so the last eight bits of the Unicode value are 10001010. The hexadecimal form of 10001010 is 8A, and the Unicode character is U+288A.

Heterogeneous data

I have a quibble with the following paragraph from Introducing Windows Azure for IT Professionals:

The problem with big data is that it’s difficult to analyze it when the data is stored in many different ways. How do you analyze data that is distributed across relational database management systems (RDBMS), XML flat-file databases, text-based log files, and binary format storage systems?

If data are in disparate file formats, that’s a pain. And from an IT perspective that may be as far as the difficulty goes. But why would data be in multiple formats? Because it’s different kinds of data! That’s the bigger difficulty.

It’s conceivable, for example, that a scientific study would collect the exact same kinds of data at two locations, under as similar conditions as possible, but one site put their data in a relational database and the other put it in XML files. More likely the differences go deeper. Maybe you have lab results for patients stored in a relational database and their phone records stored in flat files. How do you meaningfully combine lab results and phone records in a single analysis? That’s a much harder problem than converting storage formats.

* * *

For daily tips on data science, follow @DataSciFact on Twitter.

DataSciFact twitter icon

A year of consulting

I’ve been out on my own for about a year now, and it’s been a blast. If you’ve read this blog for a while you won’t be surprised to hear that I’ve been working in math, software development, and especially the overlap of the two.

As far as areas of math, I did more probability modeling than anything else. Also some work with time series, differential equations, networks, and to my surprise, a little category theory. As for software, I mostly worked in Python, R, and C++, writing code for data analysis and numerical algorithms.

People often ask what industry I work with, but my work cuts across industries. Last year I worked for a couple pharmaceuticals, a couple software companies, a search engine, etc. The most unexpected clients I had were a game developer and a wallet manufacturer.

I did a lot of small projects last year, especially when I was first getting started. It’s hard to live off small projects, but they’re fun. Micro-consulting on retainer is better. You get the variety and sense of accomplishment of small projects, but with more steady income. I have larger projects now, but I plan to keep squeezing in a few smaller projects as well as micro-consulting and mentoring.

It looks like this year will be busier than last. I have a lot more lined up than I did this time last year. I expect to do the same kind of work I did last year. I expect to branch out a little as well, though it’s too early to say much about that.

I also expect to travel more this year as well. I’ll be in Santa Barbara and Los Angeles this week and Seattle later this month. In March I’m going to The Netherlands. If you’re in one of these areas and want to get together, please let me know.