Second languages and selection bias

Posted on 7 October 2015 by John

When I was growing up, I was told that you could never become fluent in a second language, and I believed it. I had no reason not to. I didn’t know anybody who had become fluent at a second language, and I could think of plenty of people who had learned English as a second language but who weren’t fluent.

But how would I know if someone had learned English fluently? If they were fluent, I’d assume they were native speakers. I knew people had learned English as a second language because they weren’t fluent. This is selection bias, where the selection of data you see is influenced by the very thing you’re interested in.

A famous example of selection bias is that of British bombers returning from missions over Germany in World War II. Statistician Abraham Wald advised the RAF to add armor precisely where these bombers were not shot, reasoning that he was only able to inspect bombers that survived their missions. More on this story here.

When I was in college, I had a roommate who had learned Spanish fluently as a second language. I thought “That’s not possible!” though of course it is possible. I immediately began to wonder what else that I thought was impossible was merely difficulty.

Number of digits in n!

Posted on 6 October 2015 by John

The other day I ran across the fact that 23! has 23 digits. That made me wonder how often n! has n digits.

There can only be a finite number of cases, because n! grows faster than 10ⁿ for n > 10, and it’s reasonable to guess that 23 might be the largest case. Turns out it’s not, but it’s close. The only cases where n! has n digits are 1, 22, 23, and 24. Once you’ve found these by brute force, it’s not hard to show that they must be the only ones because of the growth rate of n!.

Is there a convenient way to find the number of digits in n! without having to compute n! itself? Sure. For starters, the number of digits in the base 10 representation of a number x is

⌊ log₁₀ x ⌋ + 1.

where ⌊ z ⌋ is the floor of z, the largest integer less than or equal to z. The log of the factorial function is easier to compute than the factorial itself because it won’t overflow. You’re more likely to find a function to compute the log of the gamma function than the log of factorial, and more likely to find software that uses natural logs than logs base 10. So in Python, for example, you could compute the number of digits with this:

from scipy.special import gammaln
from math import log, floor

def digits_in_factorial(n):
    return floor( gammaln(n+1)/log(10.0) ) + 1

What about a more elementary formula, one that doesn’t use the gamma function? If you use Stirling’s approximation for factorial and take log of that you should at least get a good approximation. Here it is again in Python:

from math import log, floor, pi

def stirling(n):
    return floor( ((n+0.5)*log(n) - n + 0.5*log(2*pi))/log(10) ) + 1

The code above is exact for every n > 2 as far as I’ve tested, up to n = 1,000,000. (Note that one million factorial is an extremely large number. It has 5,565,709 digits. And yet we can easily say something about this number, namely how many digits it has!)

The code may break down somewhere because the error in Stirling’s approximation or the limitations of floating point arithmetic. Stirling’s approximation gets more accurate as n increases, but it’s conceivable that a factorial value could be so close to a power of 10 that the approximation error pushes it from one side of the power of 10 to the other. Maybe that’s not possible and someone could prove that it’s not possible.

You could extend the code above to optionally take another base besides 10.

def digits_in_factorial(n, b=10):
    return floor( gammaln(n+1)/log(b) ) + 1

def stirling(n, b=10):
    return floor( ((n+0.5)*log(n) - n + 0.5*log(2*pi))/log(b) ) + 1

The code using Stirling’s approximation still works for all n > 2, even for b as small as 2. This is slightly surprising since the number of bits in a number is more detailed information than the number of decimal digits.

Data analysis vs statistics

Posted on 3 October 2015 by John

John Tukey preferred the term “data analysis” over “statistics.” In his paper Data Analysis, Computation and Mathematics, he explains why.

My title speaks of “data analysis” not “statistics”, and of “computation” not “computing science”; it does not speak of “mathematics”, but only last. Why? …

My brother-in-squared-law, Francis J. Anscombe has commented on my use of “data analysis” in the following words:

Whereas the content of Tukey’s remarks is always worth pondering, some of his terminology is hard to take. He seems to identify “statistics” with the grotesque phenomenon generally known as “mathematical statistics”, and finds it necessary to replace “statistical analysis” with “data analysis.”

(Tukey calls Anscombe his “brother-in-squared-law” because Anscombe was a fellow statistician as well as his brother-in-law. At first I thought Tukey had said “brother-in-law-squared”, which could mean his brother-in-law’s brother-in-law, but I suppose it was a pun on the role of least-square methods in statistics.)

Tukey later says

I … shall stick to this attitude today, and shall continue to use the words “data analysis”, in part to indicate that we can take probability seriously, or leave it alone, as may from time to time be appropriate or necessary.

It seems Tukey was reserving the term “statistics” for that portion of data analysis which is rigorously based on probability.

Technical arbitrage

Posted on 1 October 2015 by John

There are huge opportunities to take technology that is well-known and undervalued in one context and apply it in another where it is unknown but valuable. You could call this technical arbitrage, analogous to financial arbitrage, taking advantage of the price difference of something in two markets.

As with financial arbitrage, the hard part is spotting opportunities, not necessarily acting on them. If you want to be a hero with regular expressions, as in the xkcd cartoon below, the key isn’t knowing regular expressions well. The key is knowing about regular expressions in a context where no one else does.

To spot a technical arbitrage opportunity, you need to know what that technology can and cannot (easily) do. You also need to recognize situations where the technology can help. You don’t need to be able to stand up to a quiz show style oral exam.

Related post: You can be a hero with a simple idea