Conforming for tenure

From AnnMaria De Mars’ most recent blog post:

Recently, a young person told me that I could hold to my principles about the importance of my family, honesty and equality — and any of a hundred other things because I had “made it”.

This troubled me. It troubles me when I hear the same thing from new Ph.Ds who are trying to get tenure. I don’t see how you can pretend to be someone else for 5 or 10 years until you have “made it” and then be your true self.

Emphasis added.

Open source and pride

Liz Quilty explains how becoming an expert in open source software changed her life.

[I was a] high school dropout, I had no education, I was nobody. I’d made some poor choices, and I think at this point suddenly I was known for my knowledge rather than for my poor choices. And so it was quite good for me that I had some pride. I was like, this is really awesome, I’m really good at this, I really know this.

Source: FLOSS Weekly podcast, starting around 12:00.

A tip on using a French press

When I first bought a French press, the instructions said to pour hot but not boiling water over the coffee. They were emphatic about what the temperature should not be, but vague about what it should be. (Boiling water extracts oils that you’d rather leave in the grounds; water a little cooler brings out just the oils you want.)

I asked someone online — sorry, I don’t remember who — and he said that after your water boils, set the kettle off the burner and check your mail. When you come back, the water should be at the right temperature. That turned out to be good advice.

I don’t know whether he meant postal mail or email, but it doesn’t make much difference. Unless you get caught in replying to email and come back to room-temperature water.

Now if you really wanted to geek out on this, you could use Newton’s law of cooling along with the surface area, thickness, and material composition of your kettle to compute the time to let your water cool to 200 °F (93 °C). You could assume a kettle is a half sphere …

Related post: A childhood question about heat

French press photo

How variable are percentiles?

Suppose you’re trying to study the distribution of something by simulation. The average of your simulation values gives you an estimate of the mean value of the thing you’re simulating.

Next you want to have an idea how much the thing you’re simulating varies. You estimate the percentiles of your distribution by the percentiles of your samples. How variable are your estimates? For example, you estimate the 90th percentile of your distribution by finding the 90th percentile of your samples. How much is your estimate likely to change if you rerun you simulation with a new random number generator seed?

How does the variability of the percentiles depend on the percentile you’re looking at? For example, you might expect the median, i.e. the 50th percentile, to be the most stable. And as you start looking at more extreme percentiles, say the 5th or 95th, you might expect more variability. But how much more variability? A lot more or a little more?

To explore the questions above, let’s suppose we’re sampling from a standard normal distribution. (Of course if we really were sampling from something so well known as a normal distribution, we wouldn’t be doing simulation. But we need to pick something for illustration, so let’s use something simple.)

Suppose we simulate n values, sort them, and take the kth largest value.  This is called the kth order statistic. For example, we might estimate the 90th percentile of our distribution by using n = 1000 and k = 900. The distribution of the kth order statistic from n samples is known analytically (it’s a short derivation; see here) and so we can use it to make some plots.

These plots use n = 1000. Each uses the same horizontal scale, so you can see how the distributions get wider as we move to higher percentiles. The distribution for the 95th percentile is about twice as wide as the distribution for the median.

Each plot is centered at the corresponding quantile for the standard normal distribution. For example, the plot for the distribution of the 75th percentile of the samples is centered at the 75th percentile of the standard normal.

The distribution for the sample median is unbiased, symmetric about 0. The distributions for the 75th and 90th percentiles are slightly biased to the left, i.e. they slightly underestimate the true values on average. The 95th percentile, however, is slightly biased in the opposite direction. [Update: As pointed out in the comments below, it seems the appearance of bias was a plotting artifact.]

If you’d like to see the details of how the plot was made, here is the Python code.

import scipy as sp
import scipy.stats
import scipy.special
import matplotlib.pyplot as plt

def pdf(y, n, k):
    Phi = scipy.stats.norm.cdf(y)
    phi = scipy.stats.norm.pdf(y)
    return scipy.stats.beta.pdf(Phi, k, n+1-k)*phi

def plot_percentile(p):
    center = scipy.stats.norm.ppf(0.01*p)
    t = sp.linspace(center-0.25, center+0.25, 100)
    plt.plot(t, pdf(t, 1000, 10*p))
    plt.title( "{0}th percentile".format(p) )



Click to learn more about Bayesian statistics consulting

Simmer reading list

One of my friends mentioned his “simmer reading” yesterday. It was a typo—he meant to say “summer”—but a simmer reading list is interesting.

Simmer reading makes me think of a book that stays on your nightstand as other books come and go, like a pot left to simmer on the back burner of a stove. It’s a book you read a little at a time, maybe out of order, not something you’re trying to finish like a project.

What are some of your simmer reading books?

Related post: A book so good I had to put it down

The cult of average

Shawn Achor comments on “the cult of the average” in science.

So one of the very first things we teach people in economics and statistics and business and psychology is how, in a statistically valid way, do we eliminate the weirdos. How do we eliminate the outliers so we can find the line of best fit? Which is fantastic if I’m trying to find out how many Advil the average person should be taking — two. But if I’m interested in potential, if I’m interested in your potential, or for happiness or productivity or energy or creativity, what we’re doing is we’re creating the cult of the average with science. … If we study what is merely average, we will remain merely average.

Related posts:

Chaotic versus random

From John D. Barrow’s chapter in Design and Disorder:

The standard folklore about chaotic systems is that they are unpredictable. They lead to out-of-control dinosaur parks and out-of-work meteorologists. …

Classical … chaotic systems are not in any sense intrinsically random or unpredictable. They merely possess extreme sensitivity to ignorance. Any initial uncertainty in our knowledge of a chaotic system’s state is rapidly amplified in time.

… although they become unpredictable when you try to determine the future from a particular uncertain starting value, there may be a particular stable statistical spread of outcomes after a long time, regardless of how you started out.

Emphasis added.

Related post:

100x better approach to software?

Alan Kay speculates in this talk that 99% or even 99.9% of the effort that goes into creating a large software system is not productive. Even if the ratio of overhead and redundancy to productive code is not as high as 99 to 1, it must be pretty high.

Note that we’re not talking here about individual programmer productivity. Discussions of 10x or 100x programmers usually assume that these are folks who can use basically the same approach and same tools to get much more work done. Here we’re thinking about better approaches more than better workers.

Say you have a typical system of 10,000,000 lines of code. How many lines of code would a system with the same desired features (not necessarily all the actual features) require?

  1. If the same team of developers had worked from the beginning with a better design?
  2. If the same team of developers could rewrite the system from scratch using what they learned from the original system?
  3. If a thoughtful group of developers could rewrite the system without time pressure?
  4. If a superhuman intelligence could rewrite the system, something approaching Kolmogorov complexity?

Where does the wasted effort in large systems go, and how much could be eliminated? Part of the effort goes into discovering requirements. Large systems never start with a complete and immutable specification. That’s addressed in the first two questions.

I believe Alan Kay is interested in the third question: How much effort is wasted by brute force mediocrity? My impression from watching his talk is that he believes this is the biggest issue, not evolving requirements. He said elsewhere

Most software today is very much like an Egyptian pyramid with millions of bricks piled on top of each other, with no structural integrity, but just done by brute force and thousands of slaves.

There’s a rule that says bureaucracies follow a 3/2 law: to double productivity, you need three times as many employees. If a single developer could produce 10,000 lines of code in some period of time, you’d need to double that 10 times to get to 10,000,000 lines of code. That would require tripling your workforce 10 times, resulting in over 57,000 programmers. That sounds reasonable, maybe even a little optimistic.

Is that just the way things must be, that geniuses are in short supply and ordinary intelligence doesn’t scale efficiently? How much better could we do with available talent if took a better approach to developing software?

Related post:

For a daily dose of computer science and related topics, follow @CompSciFact on Twitter.

CompSciFact twitter icon

Random is as random does

What is randomness? Nobody knows, or at least there’s no consensus. Everybody has some vague ideas what randomness is, but when you dig into it deeply enough you find all kinds of philosophical quandaries. If you’d like a taste of the subtleties, you could start by reading one of Gregory Chaitin’s books. Or chew on this tome.

What is a random variable? That’s easy. It’s a measurable function on a probability space. What’s a probability space? Easy too. It’s a measure space such that the measure of the entire space is 1.

Probability theory avoids defining randomness by working with abstractions like random variables. This is actually a very sensible approach and not mere legerdemain. Mathematicians can prove theorems about probability and leave the interpretation of the results to others.

As far as applications are concerned, it often doesn’t matter whether something is random in some metaphysical sense. The right question isn’t “is this system random?” but rather “is it useful to model this system as random?” Many systems that no one believes are random can still be profitably modeled as if they were random.

Probability models are just another class of mathematical models. Modeling deterministic systems using random variables should be no more shocking than, for example, modeling discrete things as continuous. For example, cars come in discrete units, and they certainly are not fluids. But sometimes it’s useful to model the flow of traffic as if it were a fluid. (And sometimes it’s not.)

Random phenomena are studied using computer simulations. And these simulations rely on random number generators, deterministic programs whose output is considered random for practical purposes. This bothers some people who would prefer a “true” source of randomness. Such concerns are usually misplaced. In most cases, replacing a random number generator with some physical source of randomness would not make a detectable difference. The output of the random number generator might even be higher quality since the measurement of the physical source could introduce a bias.

Related posts:

Just what do you mean by "number"?

Tom Christiansen gave an awesome answer to the question of how to match a number with a regular expression. He begins by clarifying what the reader means by “number”, then gives answers for each.

  • Is −0 a number?
  • How do you feel about √−1?
  • Is or a number?
  • Is 186,282.42±0.02 miles/second one number — or is it two or three of them?
  • Is 6.02e23 a number?
  • Is 3.141_592_653_589 a number? How about π, or ? And −2π⁻³ ͥ?
  • How many numbers in 0.083̄?
  • How many numbers in
  • What number does hold? How about ⚂⚃?
  • Does 10,5 mm have one number in it — or does it have two?
  • Is ∛8³ a number — or is it three of them?
  • What number does ↀↀⅮⅭⅭⅬⅫ AUC represent, 2762 or 2009?
  • Are ४५६७ and ৭৮৯৮ numbers?
  • What about 0377, 0xDEADBEEF, and 0b111101101?
  • Is Inf a number? Is NaN?
  • Is ④② a number? What about ?
  • How do you feel about ?
  • What do ℵ₀ and ℵ₁ have to do with numbers? Or , , and ?

See his full response here. Thanks to Bill the Lizard for pointing this out.

For daily tips on regular expressions, follow @RegexTip on Twitter.

Regex tip icon

Eat, drink, and be merry

Almost every bit of health advice I’ve heard has been contradicted. Should you eat more carbs or fewer carbs? More fat or less fat? Take vitamin supplements or not? It reminds me of this clip from Sleeper in which Woody Allen wakes up after 200 years of suspended animation.

Offhand I can only think of a couple things on which there seems to be near unanimous agreement: smoking is bad for you, and moderate exercise is good for you.

Here are a couple suggestions for evaluating health studies.

Be suspicious of linear extrapolation. It does not follow that because moderate exercise is good for you, extreme exercise is extremely good for you. Nor does it follow that because extreme alcohol consumption is harmful, moderate alcohol consumption is moderately harmful.

Start from a default assumption that something natural or traditional is probably OK. This should not be dogmatic, only a starting point. In statistical terms, it’s a prior distribution informed by historical experience. The more a claim is at odds with nature and tradition, the more evidence it requires. If someone says fresh fruit is bad for you, for example, they need to present more evidence than someone who says an newly synthesized chemical compound is harmful. Extraordinary claims require extraordinary evidence.

Related post:

Read history and fly an airplane

The “About the Author” page at the end of Programming in Emacs Lisp says

Robert J. Chassell … has an abiding interest in social and economic history and flies his own airplane.

I love the child-like element of that bio. I could just imagine a kid saying “When I grow up, I want to read about history and fly my own airplane!” The bio is more about what the author enjoys than about how he makes his money. Maybe more bios should be like that.

The bio starts out by saying that Chassell speaks about Emacs and software freedom. I thought that was just to establish his bona fides for writing about Emacs Lisp, but his Wikipedia page says he’s a full-time speaker, so perhaps this is how he supports himself. I would not have thought that was possible, but good for him. Apparently he earns his living by talking about something he values.

Update: As suggested in the comments, perhaps Chassell’s livelihood does not come from his speaking. Maybe he has (or had) another career and chose not to include it in his bio. Or maybe he doesn’t need to earn a living from his speaking. In any case, it sounds like he’s doing something he loves and his bio focuses on that.

No Silver Bullet

Mike Swaim gave a presentation today entitled No Silver Bullet, an allusion to Fred Brook’s classic essay by the same title.

Mike discusses the pros and cons of the following software development techniques:

  • High level languages
  • Object oriented programming
  • Declarative languages
  • Functional programming
  • Data oriented design
  • Metaprogramming
  • Static typing
  • Duck typing
  • Garbage collection
  • Allocating on the stack
  • Tail calls
  • Resource acquisition is initialization
  • Symmetric multitasking
  • Heterogeneous multitasking

There’s a nice collection of links at the end of each section.

Update: Unfortunately the slides are no longer available.

Superheroes of the Round Table

The other day I was browsing the Rice library and ran across a little book called “Superheroes of the Round Table: Comics Connections to Medieval and Renaissance Literature” (ISBN 0786460687). It’s about how literature has influenced comic books, and how comic books shed light on literature.

I don’t know much about comic books, or about medieval and renaissance literature, but it’s fun to see someone draw them together, especially since the former is considered low culture and the latter high culture. It reminds me of Scott McCloud’s book Understanding Comics: The Invisible Art, a serious book about an art form that isn’t often taken seriously.

I’ve only skimmed Superheroes of the Round Table, but it looks like a fun book. It draws connections, for example, between The Faerie Queene and Iron Man and between The Tempest and X-Men.

Just to give a flavor of the book’s analytical style, here is a classification the book gives for Arthurian legend in comic books.

  1. Traditional Tale. Arthur in comic book form with minimal superhero elements.
  2. Arthurian Toybox. Elements of Arthur sprinkled into other stories with no regard for literary context.
  3. Arthur as Translator. A modern superhero is dropped into Arthur’s Britain, like Mark Twain’s Connecticut Yankee.
  4. Arthur as Collaborator. Using Arthurian symbols and themes such as the sword in the stone or the round table.
  5. Arthur Transformed. Arthur placed into a new context.

Related post: Manga guides to science

Random number sequence overlap

Mike Croucher asked the following question on his blog. Suppose you draw M sequences of random numbers of length N from a random number generator. What is the probability that they will overlap?

Assumes your random number generator is a cyclical list of p unique integers. Each draw picks a random starting point in the cycle and chooses N consecutive values. We are particularly interested in the case M = 104, N = 109, and p = 219937 – 1 (the period of the Mersenne Twister).

We will simplify the problem by over-estimating the probability of overlap. The over-estimated probability will still be extremely small.

Assume all the draws are independent. This greatly increases the probability of repetition because now there is some chance that there are repetitions within each draw. That wasn’t possible before, provided N < p.

The probability of no repetitions in n = MN draws is

prod_{i=0}^{n-1} frac{p - i}{p} = frac{p!}{p^{n}(p - n)!}

We can under-estimate this probability by replacing every term in the product with the smallest term. (By under-estimating the probability of no repetitions, we’re over-estimating the probability of at least one repetition.)

prod_{i=0}^{n-1} frac{p - i}{p} > left(frac{p - n + 1}{n}right)^n = left(1 - frac{n-1}{p}right)^n approx 1 - frac{n(n-1)}{p}

When n = MN = 1013 and p = 219937 – 1, the last expression above is approximately 1 – 1026/219937. That says the (over-estimated) probability of repetition is 1026/219937 or about 2×10-5976, thousands of orders of magnitude smaller than the chances of, say, drawing a royal flush.

Whenever you have such mind-bogglingly small probabilities, some probability outside your model has to be more important than the probability computed with the model. For example, we could question the assumption that the starting points are random and independent. Or we could question whether the random number generator was written correctly, or that the compiler correctly compiled the program, or that the operating system correctly ran the program, etc.

Related posts: