Offended by conditional probability

It’s a simple rule of probability that if A makes B more likely, B makes A more likely. That is, if the conditional probability of A given B is larger than the probability of A alone, the conditional probability of B given A is larger than the probability of B alone. In symbols,

Prob( A | B ) > Prob( A ) ⇒ Prob( B | A ) > Prob( B ).

The proof is trivial: Apply the definition of conditional probability and observe that if Prob( AB ) / Prob( B ) > Prob( A ), then Prob( AB ) / Prob( A ) > Prob( B ).

Let A be the event that someone was born in Arkansas and let B be the event that this person has been president of the United States. There are five living current and former US presidents, and one of them, Bill Clinton, was born in Arkansas, a state with about 1% of the US population. Knowing that someone has been president increases your estimation of the probability that this person is from Arkansas. Similarly, knowing that someone is from Arkansas should increase your estimation of the chances that this person has been president.

The chances that an American selected at random has been president are very small, but as small as this probability is, it goes up if you know the person is from Arkansas. In fact, it goes up by the same proportion as the opposite probability. Knowing that someone has been president increases their probability of being from Arkansas by a factor of 20, so knowing that someone is from Arkansas increases the probability that they have been president by a factor of 20 as well. This is because

Prob( A | B ) / Prob( A ) = Prob( B | A ) / Prob( B ).

This isn’t controversial when we’re talking about presidents and where they were born. But it becomes more controversial when we apply the same reasoning, for example, to deciding who should be screened at airports.

When I jokingly said that being an Emacs user makes you a better programmer, it appears a few Vim users got upset. Whether they were serious or not, it does seem that they thought “Hey, what does that say about me? I use Vim. Does that mean I’m a bad programmer?”

Assume for the sake of argument that Emacs users are better programmers, i.e.

Prob( good programmer | Emacs user )  >  Prob( good programmer ).

We’re not assuming that Emacs users are necessarily better programmers, only that a larger proportion of Emacs users are good programmers. And we’re not saying anything about causality, only probability.

Does this imply that being a Vim user lowers your chance of being a good programmer? i.e.

Prob( good programmer | Vim user )  <  Prob( good programmer )?

No, because being a Vim user is a specific alternative to being an Emacs user, and there are programmers who use neither Emacs nor Vim. What the above statement about Emacs would imply is that

Prob( good programmer | not a Emacs user )  <  Prob( good programmer ).

That is, if knowing that someone uses Emacs increases the chances that they are a good programmer, then knowing that they are not an Emacs user does indeed lower the chances that they are a good programmer, if we have no other information. In general

Prob( A | B ) > Prob( A ) ⇒ Prob( A | not B ) < Prob( A ).

To take a more plausible example, suppose that spending four years at MIT obtaining a computer science degree makes you a better programmer. Then knowing that someone has a CS degree from MIT increases the probability that this person is a good programmer. But if that’s true, it must also be true that absent any other information, knowing that someone does not have a CS degree from MIT decreases the probability that this person is a good programmer. If a larger proportion of good programmers come from MIT, then a smaller proportion must not come from MIT.

* * *

This post uses the ideas of information and conditional probability interchangeably. If you’d like to read more on that perspective, I recommend Probability Theory: The Logic of Science by E. T. Jaynes.

Lighten up and be logical

I had a little fun on Twitter this morning. From @UnixToolTip I said

Some of the best programmers use Emacs. Therefore, if you use Emacs, you’ll be a great programmer. #cargocultlogic

and from @CompSciFact I said

Some of the best programmers have beards. Therefore, growing a beard will make you a better programmer. #cargocultlogic

The serious implication behind the joke is that mimicking the superficial characteristics of a good programmer will not make you a good programmer.

Apparently most people thought these were funny, but as usual, some people got bent out of shape. They didn’t realize these were meant to be funny, or at least intentionally illogical, despite the cargo cult hash tag. They thought I was slamming vi(m) or being sexist.

Those who were offended by my humorous logic were not being logical.

Pretend for a moment that the statements above were meant seriously. If I said that using Emacs makes you a great programmer, that doesn’t mean that you can’t be a good programmer unless you use Emacs. Maybe using vi(m) makes you a better programmer too. And if I really believed that growing a beard makes you a better programmer, that doesn’t imply that people who do not grow beards are doomed to mediocrity. Maybe childbirth also makes you a better programmer, even though that option is not available to some. In logic symbols, the statement pq does not imply !p ⇒ !q.

I have two suggestions for the Twittersphere:

  1. Lighten up. Don’t take everything so seriously.
  2. If you’re going to play the logic card, be consistent.

Endeavour Selections on Facebook

I started Endeavour Selections on Facebook a little over a year ago for people who want to read the non-technical posts here but who are not so interested in math or computing. The page didn’t take off, so I stopped posting to it. But now there seems to be more interest in it, so I’m giving it another go.

I will post a few other things on the Facebook page besides blog articles, but I’ll keep the math over here.

* * *

Some people have asked about getting blog posts via email. Yes, you can do that. On the right side of the blog, there is a little box where you can enter your email address. Then each morning you’ll get an email message containing the post(s) from the previous day.

Generalized Fourier transforms

How do you take the Fourier transform of a function when the integral that would define its transform doesn’t converge? The answer is similar to how you can differentiate a non-differentiable function: you take a theorem from ordinary functions and make it a definition for generalized functions. I’ll outline how this works below.

Generalized functions are linear functionals on smooth (ordinary) functions. Given an ordinary function f, you can create a generalized function that maps a smooth test function φ to the integral of fφ.

There are other kinds of generalized functions. For example, the Dirac delta “function” is really the generalized function δ that maps φ to φ(0). This is the formalism behind the hand-wavy nonsense about a function infinitely concentrated at 0 and integrating to 1. “Integrating” the product δφ is really applying the linear functional δ to φ.

Now for absolutely integrable functions f and g, we have

\int_{-\infty}^\infty \hat{f} g = \int_{-\infty}^\infty f \hat{g}

In words, the integral of the Fourier transform of f times g equals the integral of f times the Fourier transform of g. This is the theorem we use as motivation for our definition.

Now suppose f is a function that doesn’t have a classical Fourier transform. We make f into a generalized function and define its Fourier transform as the linear function that maps a test function φ to the integral of f times the Fourier transform of φ.

More generally, the Fourier transform of a generalized function f is the linear function that maps a test function φ to the action of f on the Fourier transform of φ.

This allows us to say, for example, that the Fourier transform of the constant function f(x) = 1 is 2πδ, an exercise left for the reader.

The Heisenberg uncertainty principle for ordinary functions says that the flatter a function is, the more concentrated its Fourier transform and vice versa. Generalized Fourier transforms take this to an extreme. The Fourier transform of the flattest functions, i.e. constant functions, are multiples of the most concentrated function, the delta (generalized) function.

Automatic delimiter sizes in LaTeX

I recently read a math book in which delimiters never adjusted to the size of their content or the level of nesting. This isn’t unusual in articles, but books usually pay more attention to typography.

Here’s a part of an equation from the book:

\varphi^{-1} (\int \varphi(f+g) \,d\mu)

Larger outer parentheses make the equation much easier to read, especially as part of a complex equation. It’s clear at a glance that the function φ-1 applies to the result of the integral.

\varphi^{-1} \left(\int \varphi(f+g) \,d\mu\right)

The first equation was typeset using

    \varphi^{-1} ( \int \varphi(f+g) \,d\mu )

The latter used left and right to tell LaTeX that the parentheses should grow to match the size of the content between them.

    \varphi^{-1} \left( \int \varphi(f+g) \,d\mu \right)

You can use \left and \right with more delimiters than just parentheses: braces, brackets, ceiling, floor, etc. And the left and right delimiters do not need to match. You could make a half-open interval, for example, with \left( on one side and \right] on the other.

For every \left delimiter there must be a corresponding \right delimiter. However, you can make one of the pair empty by using a period as its mate. For example, you could start an expression with \left[ and end it with \right. which would create a left bracket as tall as the tallest thing between that bracket and the corresponding \right. command. Note that \right. causes nothing to be displayed, not even a period.

The most common example of a delimiter with no mate may be a curly brace on the left with no matching brace on the right. In that case you’d need to open with \left\{. The backslash in front of the brace is necessary to tell LaTeX that you want a literal brace and that you’re not just using the brace for grouping.

Differentiating bananas and co-bananas

I saw a tweet this morning from Patrick Honner pointing to a blog post asking how you might teach derivatives of sines and cosines differently.

One thing I think deserves more emphasis is that “co” in cosine etc. stands for “complement” as in complementary angles. The cosine of an angle is the sine of the complementary angle. For any function f(x), its complement is the function f(π/2 – x).

When memorizing a table of trig functions and their derivatives, students notice a pattern. You can turn one formula into another by replacing every function with its co-function and adding a negative sign on one side. For example,

(d/dx) tan(x) = sec2(x)

and so

(d/dx) cot(x) = – csc2(x)

In words, the derivative of tangent is secant squared, and the derivative of cotangent is negative cosecant squared.

The explanation of this pattern has nothing to do with trig functions per se. It’s just the chain rule applied to f(π/2 – x).

(d/dx) f(π/2 – x) = – f‘(π/2 – x).

Suppose you have some function banana(x) and its derivative is kiwi(x). Then the cobanana function is banana(π/2 – x), the cokiwi function [1] is kiwi((π/2 – x), and the derivative of cobanana(x) is –cokiwi(x). In trig-like notation

(d/dx) ban(x) = kiw(x)

implies

(d/dx) cob(x) = – cok(x).

Now what is unique to sines and cosines is that the second derivative gives you the negative of what you started with. That is, the sine and cosine functions satisfy the differential equation y” = –y. That doesn’t necessarily happen with bananas and kiwis. If the derivative of banana is kiwi, that doesn’t imply that the derivative of kiwi is negative banana. If the derivative of kiwi is negative banana, then kiwis and bananas must be linear combinations of sines and cosines because all solutions to y” = –y have the form a sin(x) + b cos(x).

More trig posts

[1] Authors are divided over whether the cokiwi function should be abbreviated cok or ckw.

Pretty squiggles

Here’s an image that came out of something I was working on this morning. I thought it might make an interesting border somewhere.

The blue line is sin(x), the green line 0.7 sin(φ x), and the red line is their sum. Here φ is the golden ratio (1 + √5)/2. Even though the blue and green curves are both periodic, their sum is not because the ratio of their frequencies is irrational. So you could make this image as long as you’d like and the red curve would never exactly repeat.

Update: See Almost periodic functions

Visualization, modeling, and surprises

This afternoon Hadley Wickham gave a great talk on data analysis. Here’s a paraphrase of something profound he said.

Visualization can surprise you, but it doesn’t scale well.
Modeling scales well, but it can’t surprise you.

Visualization can show you something in your data that you didn’t expect. But some things are hard to see, and visualization is a slow, human process.

Modeling might tell you something slightly unexpected, but your choice of model restricts what you’re going to find once you’ve fit it.

So you iterate. Visualization suggests a model, and then you use your model to factor out some feature of the data. Then you visualize again.

Related posts

 

Overconfidence pays

From Thinking, Fast and Slow:

Experts who acknowledge the full extent of their ignorance may expect to be replaced by more confident competitors who are better able to gain the trust of clients.

I believe Hanlon’s razor applies here: ignorance is a better explanation than dishonesty. I imagine most overconfident predictions are sincere. Unfortunately, sincere ignorance is often rewarded.

Related posts

Randomized studies of productivity

A couple days ago I wrote a blog post quoting Cal Newport suggesting that four hours of intense concentration a day is as much as anyone can sustain. That post struck a chord and has gotten a lot of buzz on Hacker News and Twitter. Most of the feedback has been agreement, but a few people have complained that this four-hour limit is based only on anecdotes, not scientific data.

Realistic scientific studies of productivity are often not feasible. For example, people often claim that programming language X makes them more productive than language Y. How could you conduct a study where you randomly assign someone a programming language to use for a career? You could do some rinky-dink study where you have 30 CS students do an artificial assignment using language X and 30 using Y. But that’s not the same thing, not by a long shot.

If someone, for example Rich Hickey, says that he’s three times more productive using one language than another, you can’t test that assertion scientifically. But what you can do is ask whether you think you are similar to that person and whether you work on similar problems. If so, maybe you should give their recommended language a try.

Suppose you wanted to test whether people are more productive when they concentrate intensely for two hours in the morning and two hours in the afternoon. You couldn’t just randomize people to such a schedule. That would be like randomizing some people to run a four-minute mile. Many people are not capable of such concentration. They either lack the mental stamina or the opportunity to choose how they work. So you’d have to start with people who have the stamina and opportunity to work the way you want to test. Then you’d randomize some of these people to working longer, fractured work days. Is that even possible? How would you keep people from concentrating? Harrison Bergeron anyone? And if it’s possible, would it be ethical?

Real anecdotal evidence is sometimes more valuable than artificial scientific data. As Tukey said, it’s better to solve the right problem the wrong way than to solve the wrong problem the right way.

Related posts