Simulating seashells

In 1838, Rev. Henry Moseley discovered that a large number of mollusk shells and other shells can be described using three parameters: kT, and D.

simulated seashell

First imagine a thin wire running through the coil of the shell. In cylindrical coordinates, this wire follows the parameterization

r = ekθ
z = Tt

If T = 0 this is a logarithmic spiral in the (r, θ) plane. For positive T, the spiral is stretched so that its vertical position is proportional to its radius.

Next we build a shell by putting a tube around this imaginary wire. The radius R of the tube at each point is proportional to the r coordinate: R = Dr.

The image above was created using k = 0.1, T = 2.791, and D = 0.8845 using Øyvind Hammer’s seashell generating software. You can download Hammer’s software for Windows and experiment with your own shell simulations by adjusting the parameters.

See also Hammer’s book and YouTube video:

Recreating the Vertigo poster

In his new book The Perfect Shape, Øyvind Hammer shows how to create a graph something like the poster for Alfred Hitchcock’s movie Vertigo.

Poster from Hitchcock's Vertigo

Hammer’s code uses a statistical language called Past that I’d never heard of. Here’s my interpretation of his code using Python.

      import matplotlib.pyplot as plt
      from numpy import arange, sin, cos, exp
      
      i  = arange(5000)
      x1 = 1.0*cos(i/10.0)*exp(-i/2500.0)
      y1 = 1.4*sin(i/10.0)*exp(-i/2500.0)
      d  = 450.0
      vx = cos(i/d)*x1 - sin(i/d)*y1
      vy = sin(i/d)*x1 + cos(i/d)*y1
      
      plt.plot(vx, vy, "k")
      
      h = max(vy) - min(vy)
      w = max(vx) - min(vx)
      plt.axes().set_aspect(w/h)
      plt.show()

This code produces what’s called a harmonograph, related to the motion of a pendulum free to move in x and y directions:

Harmonograph

It’s not exactly the same as the movie poster, but it’s definitely similar. If you find a way to modify the code to make it closer to the poster, leave a comment below.

Inverse Fibonacci numbers

As with the previous post, this post is a spinoff of a blog post by Brian Hayes. He considers the problem of determining whether a number n is a Fibonacci number and links to a paper by Gessel that gives a very simple solution: A positive integer n is a Fibonacci number if and only if either 5n2 – 4 or 5n2 + 4 is a perfect square.

If we know n is a Fibonacci number, how can we tell which one it is? That is, if n = Fm, how can we find m?

For large m, Fm is approximately φm / √ 5 and the error decreases exponentially with m. By taking logs, we can solve for m and round the result to the nearest integer.

We can illustrate this with SymPy. First, let’s get a Fibonacci number.

      >>> from sympy import *
      >>> F = fibonacci(100)
      >>> F
      354224848179261915075

Now let’s forget that we know F is a Fibonacci number and test whether it is one.

      >>> sqrt(5*F**2 - 4)
      sqrt(627376215338105766356982006981782561278121)

Apparently 5F2 – 4 is not a perfect square. Now let’s try 5F2 + 4.

      >>> sqrt(5*F**2 + 4)
      792070839848372253127

Bingo! Now that we know it’s a Fibonacci number, which one is it?

      >>> N((0.5*log(5) + log(F))/log(GoldenRatio), 10)
      100.0000000

So F must be the 100th Fibonacci number.

It looks like our approximation gave an exact result, but if we ask for more precision we see that it did not.

      >>> N((0.5*log(5) + log(F))/log(GoldenRatio), 50)
      99.999999999999999999999999999999999999999996687654

Related posts:

Approximate inverse of the gamma function

The other day I ran across a blog post by Brian Hayes that linked to an article by David Cantrell on how to compute the inverse of the gamma function. Cantrell gives an approximation in terms of the Lambert W function.

In this post we’ll write a little Python code to kick the tires on Cantrell’s approximation. The post also illustrates how to do some common tasks using SciPy and matplotlib.

Here are the imports we’ll need.

      import matplotlib.pyplot as plt
      from scipy import pi, e, sqrt, log, linspace
      from scipy.special import lambertw, gamma, psi
      from scipy.optimize import root

First of all, the gamma function has a local minimum k somewhere between 1 and 2, and so it only makes sense to speak of its inverse to the left or right of this point. Gamma is strictly increasing for real values larger than k.

To find k we look for where the derivative of gamma is zero. It’s more common to work with the derivative of the logarithm of the gamma function than the derivative of the gamma function itself. That works just as well because gamma has a minimum where its log has a minimum. The derivative of the log of the gamma function is called ψ and is implemented in SciPy as scipy.special.psi. We use the function scipy.optimize.root to find where ψ is zero.

The root function returns more information than just the root we’re after. The root(s) are returned in the arrayx, and in our case there’s only one root, so we take the first element of the array:

      k = root(psi, 1.46).x[0]

Now here is Cantrell’s algorithm:

      c = sqrt(2*pi)/e - gamma(k)
      
      def L(x):
          return log((x+c)/sqrt(2*pi))
      
      def W(x):
          return lambertw(x)
      
      def AIG(x):
          return L(x) / W( L(x) / e) + 0.5

Cantrell uses AIG for Approximate Inverse Gamma.

How well goes this algorithm work? For starters, we’ll see how well it does when we do a round trip, following the exact gamma with the approximate inverse.

      x = linspace(5, 30, 100)
      plt.plot(x, AIG(gamma(x)))
      plt.show()

This produces the following plot:

We get a straight line, as we should, so next we do a more demanding test. We’ll look at the absolute error in the approximate inverse. We’ll use a log scale on the x-axis since gamma values get large quickly.

      y = gamma(x)
      plt.plot(y, x- AIG(y))
      plt.xscale("log")
      plt.show()

This shows the approximation error is small, and gets smaller as its argument increases.

Cantrell’s algorithm is based on an asymptotic approximation, so it’s not surprising that it improves for large arguments.

Related posts:

How efficient is Morse code?

telegraph

Morse code was designed so that the most frequently used letters have the shortest codes. In general, code length increases as frequency decreases.

How efficient is Morse code? We’ll compare letter frequencies based on Google’s research with the length of each code, and make the standard assumption that a dash is three times as long as a dot.

|--------+------+--------+-----------|
| Letter | Code | Length | Frequency |
|--------+------+--------+-----------|
| E      | .    |      1 |    12.49% |
| T      | -    |      3 |     9.28% |
| A      | .-   |      4 |     8.04% |
| O      | ---  |      9 |     7.64% |
| I      | ..   |      2 |     7.57% |
| N      | -.   |      4 |     7.23% |
| S      | ...  |      3 |     6.51% |
| R      | .-.  |      5 |     6.28% |
| H      | .... |      4 |     5.05% |
| L      | .-.. |      6 |     4.07% |
| D      | -..  |      5 |     3.82% |
| C      | -.-. |      8 |     3.34% |
| U      | ..-  |      5 |     2.73% |
| M      | --   |      6 |     2.51% |
| F      | ..-. |      6 |     2.40% |
| P      | .--. |      8 |     2.14% |
| G      | --.  |      7 |     1.87% |
| W      | .--  |      7 |     1.68% |
| Y      | -.-- |     10 |     1.66% |
| B      | -... |      6 |     1.48% |
| V      | ...- |      6 |     1.05% |
| K      | -.-  |      7 |     0.54% |
| X      | -..- |      8 |     0.23% |
| J      | .--- |     10 |     0.16% |
| Q      | --.- |     10 |     0.12% |
| Z      | --.. |      8 |     0.09% |
|--------+------+--------+-----------|

There’s room for improvement. Assigning the letter O such a long code, for example, was clearly not optimal.

But how much difference does it make? If we were to rearrange the codes so that they corresponded to letter frequency, how much shorter would a typical text transmission be?

Multiplying the code lengths by their frequency, we find that an average letter, weighted by frequency, has code length 4.5268.

What if we rearranged the codes? Then we would get 4.1257 which would be about 9% more efficient. To put it another way, Morse code achieved 91% of the efficiency that it could have achieved with the same codes. This is relative to Google’s English corpus. A different corpus would give slightly different results.

Toward the bottom of the table above, letter frequencies correspond poorly to code lengths, though this hardly matters for efficiency. But some of the choices near the top of the table are puzzling. The relative frequency of the first few letters has remained stable over time and was well known long before Google. (See ETAOIN SHRDLU.) Maybe there were factors other than efficiency that influenced how the most frequently used characters were encoded.

Update: Some sources I looked at said that a dash is three times as long as a dot, including the space between dots or dashes. Others said there is a pause as long as a dot between elements. If you use the latter timing, it takes an average time equal to 6.0054 dots to transmit an English letter, and this could be improved to 5.6616. By that measure Morse code is about 93.5% efficient. (I only added time for space inside the code for a letter because the space between letters is the same no matter how they are coded.)

Tidying up trivial details

The following quote gives a good description of the value of abstract mathematics. The quote speaks specifically of “universal algebra,” but consistent with the spirit of the quote you could generalize it to other areas of mathematics, especially areas such as category theory.

Universal algebra is the study of features common to familiar algebraic systems … [It] places the algebraic notions in their proper setting; it often reveals connexions between seemingly different concepts and helps to systemize one’s thoughts. … [T]his approach does not usually solve the whole problem for us, but only tidies up a mass of rather trivial detail, allowing us to concentrate our powers on the hard core of the problem.

Emphasis added. Source: Universal Algebra by P. M. Cohn

Related: Applied category theory

Measuring graph robustness

There are a couple ways to measure how well a graph remains connected when nodes are removed. One ways is to consider nodes dropping out randomly. Another way, the one we look at here, assumes an adversary is trying to remove the most well-connected nodes. This approach was defined by Schneider et al [1]. It is also described, a little more clearly, in [2].

The robustness of a graph is defined as

R = \frac{1}{N} \sum_{q=1}^{N-1} S(q)

Then [2] explains that

N is the total number of nodes in the initial network and S(q) is the relative size of the largest connected component after removing q nodes with the highest degrees. The normalization factor 1/N ensures 1/NR ≤ 0.5. Two special cases exist: R = 1/N corresponds to star-like networks while R = 0.5 represents the case of the fully connected network.

Apparently “relative size” means relative to the size of the original network, not to the network with q nodes removed.

There is one difference between [1] and [2]. The former defines the sum up to N and the latter to N-1. But this doesn’t matter because S(N) = 0: when you remove N nodes from a graph with N nodes, there’s nothing left!

I have a couple reservations. One is that it’s not clear whether R is well defined.

Consider the phrase “removing q nodes with the highest degrees.” What if you have a choice of nodes with the highest degree? Maybe all nodes have degree no more than n and several have degree exactly n. It seems your choice of node to remove matters. Removing one node of degree n may give you a very different network than removing another node of the same degree.

Along the same lines, it’s not clear whether it matters how one removes more than one degree. For S(2), for example, does it matter whether I remove the two most connected nodes from the original graph at once, or remove the most connected node first, then the most connected node from what remains?

My second reservation is that I don’t think the formula above gives exactly the extreme values it says. Start with a star graph. Once you remove the center node, there are only isolated points left, and each point is 1/N of the original graph. So all the S(q) are equal to 1/N. Then R equals (N – 1)/ N2, which is less than 1/N.

With the complete graph, removing a point leaves a complete graph of lower order. So S(q) = (Nq)/N. Then R equals (N – 1)/2N, which is less than 1/2.

* * *

[1] CM Schneider et al, Mitigation of malicious attacks on networks. P. Natl. Acad. March 8, 2011, vol. 108. no. 10

[2] Big Data of Complex Networks, CRC Press. Matthias Dehmer et al editors.

Data-driven charity

In this post I interview GiveDirectly co-founder Paul Niehaus about charitable direct cash transfers and their empirical approach to charity.

Paul Niehaus of GiveDirectly

JC: Can you start off by telling us a little bit about Give Directly, and what you do?

PN: GiveDirectly is the first nonprofit that lets individual donors like you and me send money directly to the extreme poor. And that’s it—we don’t buy them things we think they need, or tell them what they should be doing, or how they should be doing it. Michael Faye and I co-founded GD, along with Jeremy Shapiro and Rohit Wanchoo, because on net we felt (and still feel) the poor have a stronger track record putting money to use than most of the intermediaries and experts who want to spend it for them.

JC: What are common objections you brush up against, and how do you respond?

PN: We’ve all heard and to some extent internalized a lot of negative stereotypes about the extreme poor—you can’t just give them money, they’ll blow it on alcohol, they won’t work as hard, etc. And it’s only in the last decade or so with the advent of experimental testing that we’ve build a broad evidence base showing that in fact quite the opposite is the case—in study after study the poor have used money sensibly, and if anything drank less and worked more. So to us it’s simply a question of catching folks up on the data.

JC: Why do you think randomized controlled trials are emerging in development economics just in the past decade or so when it has been a standard tool gold standard in other areas for much longer?

PN: I agree that experimental testing in development is long overdue. And to be blunt, I think it came late because we worry more about getting real results when we’re helping ourselves than we do when we’re helping others. When it comes to helping others, we get our serotonin from believing we’re making a difference, not the actual difference we make (which we may never find out, for example when we give to charities overseas). And so it’s tempting to succumb to wishful thinking rather than rigorous testing.

JC: What considerations went into the design of your pending basic income trial? What would you have loved to do differently methodologically if you had 10X the budget? 100X?

PN: This experiment is all about scale, in a couple of ways. First, there have been some great basic income pilots in the past, but they haven’t committed to supporting people for more than a few years. That’s important because a big argument the “pro” camp makes is that guaranteeing long-term economic security will free people up to take risks, be more creative, etc.—and a big worry the “con” camp raises is that it will cause people to stop trying. So it was important to commit to support over a long period. We’re doing over a decade—12 years—and with more funding we’d go even longer.

Second, it’s important to test this by randomizing at the community level, not just the individual level. That’s because a lot of the debate over basic income is about how community interactions will change (vs purely individual behavior). So we’re enrolling entire villages—and with more funding, we could make that entire counties, etc. That lets you start to understanding impacts on community cohesion, social capital, the macroeconomy, etc.

JC: In what ways do you think math has served as a good or poor guide for development economics over the years?

PN: I think the far more important question is why has math—and in particular statistics—played such a small role in development decision-making, while “success stories” and “theories of change” have played such large ones.

JC: Can you say something about the efficiency of GiveDirectly?

PN: What we’ve tried to do at GD is, first, be very clear about our marginal cost structure—typically around 90% in the hands of the poor, 10% on costs of enrolling them and delivering funds; and second, provide evidence on how these transfers affect a wide range of outcomes and let donors judge for themselves how valuable those outcomes are.

JC: What is your vision for a methodologically sound poverty reduction research program? What are the main pitfalls and challenges you see?

PN: First, we need to run experiments at larger scales. Testing new ideas in a few villages, run by an NGO, is a great start, but it’s not always an accurate to guide to how an intervention will perform when a government tries to deliver it nation-wide, or how doing something at that scale will affect the broader economy (what we call “general equilibrium effects”). I’ve written about this recently with Karthik Muralidharan based on some of our recent experiences running large-scale evaluations in India.

Second, we need to measure value created for the poor. RCTs tell us how an intervention changes “outcomes,” but not how valuable those outcomes are. That’s fine if you want to assign your own values to outcomes—I could be an education guy, say, and care only about years of formal schooling. But if we care at all about the values and priorities of the poor themselves, we need a different approach. One simple step is to ask people how much money an intervention is worth to them—what economists call their “willingness to pay.” If we’re spending $100 on a program, we’d hope it’s worth at least that much to the beneficiary. If not, begs the question why we don’t just give them the money.

JC: What can people do to help?

PN: Lots of things. Here are a few:

  1. Set up a recurring donation, preferably to the basic income project. Worst case scenario your money will make life much better for someone in extreme poverty; best case, it will also generate evidence that redefines anti-poverty policy.
  2. Follow ten recipients on GDLive. Share things they say that you find interesting. Give us feedback on the experience (which is very beta).
  3. Ask five friends whether they give money to poor people. Find out what they think and why. Share the evidence and information we’ve published and then give us feedback—what was helpful? What was missing?
  4. Ask other charities to publish the experimental evidence on their interventions prominently on their websites, and to explain why they are confident that they can add more value for the poor by spending money on their behalf than the poor could create for themselves if they had the money. Some do! But we need to create a world where simply publishing a few “success stories” doesn’t cut it any more.

Related post: Interview with Food for the Hungry CIO

Big data and the law

computer law

Excerpt from the new book Big Data of Complex Networks:

Big Data and data protection law provide for a number of mutual conflicts: from the perspective of Big Data analytics, a strict application of data protection law as we know it today would set an immediate end to most Big Data applications. From the perspective of the law, Big Data is either a big threat … or a major challenge for international and national lawmakers to adopt today’s data protection laws to the latest technological and economic developments.

Emphasis added.

The author of the chapter on legal matters is Swiss and writes primarily in a European context, though all countries face similar problems.

I’m not a lawyer, though I sometimes work with lawyers, and sometimes help companies with the statistical aspects of HIPAA law. But as a layman the observation above sounds reasonable to me, that strict application of the law could bring many applications to a halt, for better and for worse.

In my opinion the regulations around HIPAA and de-identification are mostly reasonable. The things it prohibits mostly should be prohibited. And it has a common sense provision in the form of expert determination. If your data uses fall outside the regulation’s specific recommendations but don’t endanger privacy, you can have an expert can certify that this is the case.

Particle filter and unscented Kalman filter in a nutshell

Suppose you have a linear dynamic system. That is, the function that predicts the next state from the current state to the next is linear. Suppose also that the states in your system are not known precisely but have some uncertainty modeled by a (multivariate) normal distribution. Then the uncertainty in the state at the next step also has a normal distribution, because a linear transformation of a normal distribution remains normal. This is a very high-level description of the classic Kalman filter.

When the transition from one state to another is nonlinear, the probability distribution around future states is not normal. There are many variations on the Kalman filter that amount to various approximations to get around this core difficulty: extended Kalman filters, unscented Kalman filters, particle filters, etc. Here I’ll just discuss unscented Kalman filters and particle filters. This will be a very hand-wavy discussion, but it will give the basic ideas.

It’s easy to push discrete random points through a nonlinear transformation. Calculating the effect of a nonlinear transformation on continuous random variables is more work. The idea of an unscented Kalman filter is to create a normal distribution that approximates the result of a nonlinear transformation by seeing what happens to a few carefully chosen points. The idea of particle filtering is to randomly generate a cloud of points and push these points through the nonlinear transformation.

This may make it sound like (unscented) Kalman filters and particle filters are trivial to implement. Far from it. The basic framework is easy to understand, but getting good performance on a particular problem can take a lot of work.

Related posts:

Heavy-tailed random matrices

Suppose you fill the components of a matrix with random values. How are the eigenvalues distributed?

We limit our attention to large, symmetric matrices. We fill the entries of the matrix on the diagonal and above the diagonal with random elements, then fill in the elements below the diagonal by symmetry.

If we choose our random values from a thin-tailed distribution, then Wigner’s semicircle law tells us what to expect from our distribution. If our matrices are large enough, then we expect the probability density of eigenvalues to be semicircular. To illustrate this, we’ll fill a matrix with samples from a standard normal distribution and compute its eigenvalues with the following Python code.

      import numpy as np
      import matplotlib.pyplot as plt

      N = 5000
      A = np.random.normal(0, 1, (N, N))    
      B = (A + A.T)/np.sqrt(2)
      eigenvalues = np.linalg.eigvalsh(B) 
      print(max(eigenvalues), min(eigenvalues))

      plt.hist(eigenvalues, bins=70)
      plt.show()

We first create an N by N non-symmetric matrix, then make it symmetric by adding it to its transpose. (That’s easier than only creating the upper-triangular elements.) We divide by the square root of 2 to return the variance of each component to its original value, in this case 1. The eigenvalues in this particular experiment ran from -141.095 to 141.257 and their histogram shows the expected semi-circular shape.

eigenvalue distribution with normally distributed matrix entries

Wigner’s semicircle law does not require the samples to come from a normal distribution. Any distribution with finite variance will do. We illustrate this by replacing the normal distribution with a Laplace distribution with the same variance and rerunning the code. That is, we change the definition of the matrix A to

      A = np.random.laplace(0, np.sqrt(0.5), (N, N))

and get very similar results. The eigenvalues ran from -140.886 to 141.514 and again we see a semicircular distribution.

eigenvalue distribution for matrix with entries drawn from Laplace distribution

But what happens when we draw samples from a heavy-tailed distribution, i.e. one without finite variance? We’ll use a Student-t distribution with ν = 1.8 degrees of freedom. When ν > 2 the t-distribution has variance ν/(ν – 2), but for smaller values of ν it has no finite variance. We change the definition of the matrix A to the following:

      A = np.random.standard_t(1.8, (N, N))

and now the distribution is quite different.

eigenvalue distribution for matrix with entries drawn from Student t distribution with 1.8 degrees of freedom

In this case the minimum eigenvalue was -9631.558 and the largest was 9633.853. When the matrix components are selected from a heavy-tailed distribution, the eigenvalues also have a heavier-tailed distribution. In this case, nearly all the eigenvalues are between -1000 and 1000, but the largest and smallest were 10 times larger. The eigenvalues are more variable, and their distribution looks more like a normal distribution and less like a semicircle.

The distributions for all thin-tailed symmetric matrices are the same. They have a universal property. But each heavy-tailed distribution gives rise to a different distribution on eigenvalues. With apologies to Tolstoy, thin-tailed matrices are all alike; every thick-tailed matrix is thick-tailed in its own way.

Update: As the first comment below rightfully points out, the diagonal entries should be divided by 2, not sqrt(2). Each of the off-diagonal elements of A + AT are the sum of two independent random variables, but the diagonal elements are twice what they were in A. To put it another way, the diagonal elements are the sum of perfectly correlated random variables, not independent random variables.

I reran the simulations with the corrected code and it made no noticeable difference, either to the plots or to the range of the eigenvalues.

Simplify integration with complex variables

Last night I was helping my daughter with her calculus homework. One of the problems was the following integral:

\int \exp(-x) \sin(4x) \, dx

This is an interesting problem for two reasons. First, it illustrates a clever variation on integration by parts; that’s why it was assigned. But it can also be computed using complex variables. As is often the case, the “complex” approach is simpler. Below I’ll show the solution the students were expected to find, then one that they wouldn’t not be expected to find.

Integration by parts

The traditional approach to this integral is to integrate by parts. Letting u = sin(4x), the integral becomes

- \exp(-x) \sin(4x) + 4 \int \exp(-x) \cos(4x) \, dx

Next we integrate by parts one more time, this time letting u = cos(4x). This gives us

- \exp(-x) \sin(4x) - 4 \exp(-x) \cos(4x) -16\int \exp(-x) \sin(4x) \, dx

At this point it looks like we’re getting nowhere. We could keep on integrating by parts forever. Not only that, we’re going in circles: we have an integral that’s just like the one we started with. But then the clever step is to realize that this is a good thing. Let I be our original integral. Then

I = -\exp(x) \left(\sin(4x) + 4 \cos(4x)\right) - 16 I

Now we can solve for I:

I = - \frac{1}{17} \exp(-x) (\sin(4x) + 4 \cos(4x))

Complex variables

Here’s another approach. Recognize that sin(4x) is the imaginary part of exp(4ix) and so our integral is the imaginary part of

\int \exp(-x) \exp(-4ix) \,dx = \int \exp\left((-1+4i)x\right) \,dx

which we can integrate immediately:

\frac{1}{-1+4i} \exp\left((-1+4i)x\right).

There’s still algebra to do, but the calculus is over. And while the algebra will take a few steps, it’s routine.

First, let’s take care of the fraction.

\frac{1}{-1+4i} = \frac{1}{-1+4i} \frac{-1 - 4i}{-1 - 4i} = - \frac{1+4i}{17}.

Next,

\exp\left((-1+4i)x\right) = \exp(-x)\left( \cos(4x) + i \sin(4x) \right)

and so our integral is the complex part of

-\frac{1}{17} \exp(-x) (1 + 4i)\left( \cos(4x) + i \sin(4x) \right)

which gives us the same result as before.

The complex variable requires one insight: recognizing a sine as the real part of an exponential. The traditional approach requires several insights: two integrations by parts and solving for the original integral.

Related:

Sticky cards

Suppose you shuffle a deck of cards. How likely is it that there are two cards that were next to each other before the shuffle are still next to each other after the shuffle?

It depends on how well you shuffle the cards. If you do what’s called a “faro shuffle” then the probability of a pair of cards remaining neighbors is 0. In a faro shuffle, if the cards were originally numbered 1 through 52, the new arrangement will be 1, 27, 2, 28, 3, 29, …, 26, 52. All pairs are split up.

But if you shuffle the cards so well that the new arrangement is a random permutation of the original, there’s about an 86% chance that at least one pair of neighbors remain neighbors. To be more precise, consider permutations of n items. As n increases, the probability of no two cards remaining consecutive converges to exp(-2), and so the probability of at least two cards remaining next to each other converges to 1 – exp(-2).

By the way, this considers pairs staying next to each other in either order. For example, if the cards were numbered consecutively, then either a permutation with 7 followed by 8 or 8 followed by 7 would count.

More generally, for large n, the probability of k pairs remaining consecutive after a shuffle is 2ke-2 / k!.

One application of this result would be testing. If you randomly permute a list of things to break up consecutive pairs, it probably won’t work. Instead, you might want to split your list in half, randomize each half separately, then interleave the two half lists as in the faro shuffle.

Another application would be fraud detection. If you’re suspicious that a supposedly random process isn’t splitting up neighbors, you could use the result presented here to calibrate your expectations. Maybe what you’re seeing is consistent with good randomization. Or maybe not. Note that the result here looks at any pair remaining together. If a particular pair remains together consistently, that’s more suspicious.

Related posts:

Reference: David P. Robbins, The Probability that Neighbors Remain Neighbors After Random Rearrangements. The American Mathematical Monthly, Vol. 87, No. 2, pp. 122-124

Category Theory and Facebook

From Drew Armstrong’s notes on adjoint functors:

Once upon a time, my opinion of category theory was the same as my opinion of Facebook: if I ignore it for long enough, hopefully it will go away. It is now my educated opinion that category theory will not go away, and in fact the language of category theory will continue to spread until it becomes the default foundation of mathematics.

More posts on category theory:

Infinite primes via Fibonacci numbers

A well-known result about Fibonacci numbers says

gcd(Fm, Fn) = Fgcd(m, n)

In words, the greatest common divisor of the mth and nth Fibonacci numbers is the gth Fibonacci number where g is the greatest common divisor of m and n. You can find a proof here.

M. Wunderlich used this identity to create a short, clever proof that there are infinitely many primes.

Suppose on the contrary that there are only k prime numbers, and consider the set of Fibonacci numbers with prime indices: Fp1, Fp2, … Fpk. The Fibonacci theorem above tells us that these Fibonacci numbers are pairwise relatively prime: each pair has greatest common divisor F1 = 1.

Each of the Fibonacci numbers in our list must have only one prime factor. Why? Because we have assumed there are only k prime numbers, and no two of our Fibonacci numbers share a prime factor. But F19 = 4181 = 37*113. We’ve reached a contradiction, and so there must be infinitely many primes.

Source: M. Wunderlich, Another proof of the infinite primes theorem. American Mathematical Monthly, Vol. 72, No. 3 (March 1965), p. 305.

Here are a couple more unusual proofs that there are infinitely many primes. The first uses a product of sines. The second is from Paul Erdős. It also gives a lower bound on π(N), the number of primes less than N.