Conforming for tenure

Posted on 30 April 2012 by John

From AnnMaria De Mars’ most recent blog post:

Recently, a young person told me that I could hold to my principles about the importance of my family, honesty and equality—and any of a hundred other things because I had “made it.”

This troubled me. It troubles me when I hear the same thing from new Ph.Ds who are trying to get tenure. I don’t see how you can pretend to be someone else for 5 or 10 years until you have “made it” and then be your true self.

Emphasis added.

Open source and pride

Posted on 28 April 2012 by John

Liz Quilty explains how becoming an expert in open source software changed her life.

[I was a] high school dropout, I had no education, I was nobody. I’d made some poor choices, and I think at this point suddenly I was known for my knowledge rather than for my poor choices. And so it was quite good for me that I had some pride. I was like, this is really awesome, I’m really good at this, I really know this.

Source: FLOSS Weekly podcast, starting around 12:00.

A tip on using a French press

Posted on 27 April 2012 by John

When I first bought a French press, the instructions said to pour hot but not boiling water over the coffee. They were emphatic about what the temperature should not be, but vague about what it should be. (Boiling water extracts oils that you’d rather leave in the grounds; water a little cooler brings out just the oils you want.)

I asked someone online—sorry, I don’t remember who—and he said that after your water boils, set the kettle off the burner and check your mail. When you come back, the water should be at the right temperature. That turned out to be good advice.

I don’t know whether he meant postal mail or email, but it doesn’t make much difference. Unless you get caught in replying to email and come back to room-temperature water.

Now if you really wanted to geek out on this, you could use Newton’s law of cooling along with the surface area, thickness, and material composition of your kettle to compute the time to let your water cool to 200 °F (93 °C). You could assume a kettle is a half sphere …

Related post: A childhood question about heat

How variable are percentiles?

Posted on 27 April 2012 by John

Suppose you’re trying to study the distribution of something by simulation. The average of your simulation values gives you an estimate of the mean value of the thing you’re simulating.

Next you want to have an idea how much the thing you’re simulating varies. You estimate the percentiles of your distribution by the percentiles of your samples. How variable are your estimates? For example, you estimate the 90th percentile of your distribution by finding the 90th percentile of your samples. How much is your estimate likely to change if you rerun you simulation with a new random number generator seed?

How does the variability of the percentiles depend on the percentile you’re looking at? For example, you might expect the median, i.e. the 50th percentile, to be the most stable. And as you start looking at more extreme percentiles, say the 5th or 95th, you might expect more variability. But how much more variability? A lot more or a little more?

To explore the questions above, let’s suppose we’re sampling from a standard normal distribution. (Of course if we really were sampling from something so well known as a normal distribution, we wouldn’t be doing simulation. But we need to pick something for illustration, so let’s use something simple.)

Suppose we simulate n values, sort them, and take the kth largest value. This is called the kth order statistic. For example, we might estimate the 90th percentile of our distribution by using n = 1000 and k = 900. The distribution of the kth order statistic from n samples is known analytically (it’s a short derivation; see here) and so we can use it to make some plots.

These plots use n = 1000. Each uses the same horizontal scale, so you can see how the distributions get wider as we move to higher percentiles. The distribution for the 95th percentile is about twice as wide as the distribution for the median.

Each plot is centered at the corresponding quantile for the standard normal distribution. For example, the plot for the distribution of the 75th percentile of the samples is centered at the 75th percentile of the standard normal.

The distribution for the sample median is unbiased, symmetric about 0. The distributions for the 75th and 90th percentiles are slightly biased to the left, i.e. they slightly underestimate the true values on average. The 95th percentile, however, is slightly biased in the opposite direction. [Update: As pointed out in the comments below, it seems the appearance of bias was a plotting artifact.]

If you’d like to see the details of how the plot was made, here is the Python code.

import scipy as sp
import scipy.stats
import scipy.special
import matplotlib.pyplot as plt

def pdf(y, n, k):
    Phi = scipy.stats.norm.cdf(y)
    phi = scipy.stats.norm.pdf(y)
    return scipy.stats.beta.pdf(Phi, k, n+1-k)*phi

def plot_percentile(p):
    center = scipy.stats.norm.ppf(0.01*p)
    t = sp.linspace(center-0.25, center+0.25, 100)
    plt.plot(t, pdf(t, 1000, 10*p))
    plt.title( "{0}th percentile".format(p) )

plt.subplot(221)
plot_percentile(50)
plt.subplot(222)
plot_percentile(75)
plt.subplot(223)
plot_percentile(90)
plt.subplot(224)
plot_percentile(95)
plt.show()

Simmer reading list

Posted on 26 April 2012 by John

One of my friends mentioned his “simmer reading” yesterday. It was a typo—he meant to say “summer”—but a simmer reading list is interesting.

Simmer reading makes me think of a book that stays on your nightstand as other books come and go, like a pot left to simmer on the back burner of a stove. It’s a book you read a little at a time, maybe out of order, not something you’re trying to finish like a project.

What are some of your simmer reading books?

Related post: A book so good I had to put it down

The cult of average

Posted on 25 April 2012 by John

Shawn Achor comments on “the cult of the average” in science.

So one of the very first things we teach people in economics and statistics and business and psychology is how, in a statistically valid way, do we eliminate the weirdos. How do we eliminate the outliers so we can find the line of best fit? Which is fantastic if I’m trying to find out how many Advil the average person should be taking — two. But if I’m interested in potential, if I’m interested in your potential, or for happiness or productivity or energy or creativity, what we’re doing is we’re creating the cult of the average with science. … If we study what is merely average, we will remain merely average.

Chaotic versus random

Posted on 24 April 2012 by John

From John D. Barrow’s chapter in Design and Disorder:

The standard folklore about chaotic systems is that they are unpredictable. They lead to out-of-control dinosaur parks and out-of-work meteorologists. …

Classical … chaotic systems are not in any sense intrinsically random or unpredictable. They merely possess extreme sensitivity to ignorance. Any initial uncertainty in our knowledge of a chaotic system’s state is rapidly amplified in time.

… although they become unpredictable when you try to determine the future from a particular uncertain starting value, there may be a particular stable statistical spread of outcomes after a long time, regardless of how you started out.

Emphasis added.

100x better approach to software?

Posted on 23 April 2012 by John

Alan Kay speculates in this talk that 99% or even 99.9% of the effort that goes into creating a large software system is not productive. Even if the ratio of overhead and redundancy to productive code is not as high as 99 to 1, it must be pretty high.

Note that we’re not talking here about individual programmer productivity. Discussions of 10x or 100x programmers usually assume that these are folks who can use basically the same approach and same tools to get much more work done. Here we’re thinking about better approaches more than better workers.

Say you have a typical system of 10,000,000 lines of code. How many lines of code would a system with the same desired features (not necessarily all the actual features) require?

If the same team of developers had worked from the beginning with a better design?
If the same team of developers could rewrite the system from scratch using what they learned from the original system?
If a thoughtful group of developers could rewrite the system without time pressure?
If a superhuman intelligence could rewrite the system, something approaching Kolmogorov complexity?

Where does the wasted effort in large systems go, and how much could be eliminated? Part of the effort goes into discovering requirements. Large systems never start with a complete and immutable specification. That’s addressed in the first two questions.

I believe Alan Kay is interested in the third question: How much effort is wasted by brute force mediocrity? My impression from watching his talk is that he believes this is the biggest issue, not evolving requirements. He said elsewhere

Most software today is very much like an Egyptian pyramid with millions of bricks piled on top of each other, with no structural integrity, but just done by brute force and thousands of slaves.

There’s a rule that says bureaucracies follow a 3/2 law: to double productivity, you need three times as many employees. If a single developer could produce 10,000 lines of code in some period of time, you’d need to double that 10 times to get to 10,000,000 lines of code. That would require tripling your workforce 10 times, resulting in over 57,000 programmers. That sounds reasonable, maybe even a little optimistic.

Is that just the way things must be, that geniuses are in short supply and ordinary intelligence doesn’t scale efficiently? How much better could we do with available talent if took a better approach to developing software?

Random is as random does

Posted on 19 April 2012 by John

What is randomness? Nobody knows, or at least there’s no consensus. Everybody has some vague ideas what randomness is, but when you dig into it deeply enough you find all kinds of philosophical quandaries. If you’d like a taste of the subtleties, you could start by reading one of Gregory Chaitin’s books. Or chew on this tome.

What is a random variable? That’s easy. It’s a measurable function on a probability space. What’s a probability space? Easy too. It’s a measure space such that the measure of the entire space is 1.

Probability theory avoids defining randomness by working with abstractions like random variables. This is actually a very sensible approach and not mere legerdemain. Mathematicians can prove theorems about probability and leave the interpretation of the results to others.

As far as applications are concerned, it often doesn’t matter whether something is random in some metaphysical sense. The right question isn’t “is this system random?” but rather “is it useful to model this system as random?” Many systems that no one believes are random can still be profitably modeled as if they were random.

Probability models are just another class of mathematical models. Modeling deterministic systems using random variables should be no more shocking than, for example, modeling discrete things as continuous. For example, cars come in discrete units, and they certainly are not fluids. But sometimes it’s useful to model the flow of traffic as if it were a fluid. (And sometimes it’s not.)

Random phenomena are studied using computer simulations. And these simulations rely on random number generators, deterministic programs whose output is considered random for practical purposes. This bothers some people who would prefer a “true” source of randomness. Such concerns are usually misplaced. In most cases, replacing a random number generator with some physical source of randomness would not make a detectable difference. The output of the random number generator might even be higher quality since the measurement of the physical source could introduce a bias.

Just what do you mean by "number"?

Posted on 18 April 2012 by John

Tom Christiansen gave an awesome answer to the question of how to match a number with a regular expression. He begins by clarifying what the reader means by “number”, then gives answers for each.

Is −0 a number?
How do you feel about √−1?
Is ⅝ or ⅔ a number?
Is 186,282.42±0.02 miles/second one number — or is it two or three of them?
Is 6.02e23 a number?
Is 3.141_592_653_589 a number? How about π, or ℯ? And −2π⁻³ ͥ?
How many numbers in 0.083̄?
How many numbers in 128.0.0.1?
What number does ⚄ hold? How about ⚂⚃?
Does 10,5 mm have one number in it — or does it have two?
Is ∛8³ a number — or is it three of them?
What number does ↀↀⅮⅭⅭⅬⅫ AUC represent, 2762 or 2009?
Are ४५६७ and ৭৮৯৮ numbers?
What about 0377, 0xDEADBEEF, and 0b111101101?
Is Inf a number? Is NaN?
Is ④② a number? What about ⓰?
How do you feel about ㊅?
What do ℵ₀ and ℵ₁ have to do with numbers? Or ℝ, ℚ, and ℂ?

See his full response here. Thanks to Bill the Lizard for pointing this out.

Month: April 2012