Mercury and the bandwagon effect

Mercury

The study of the planet Mercury provides two examples of the bandwagon effect. In her new book Worlds Fantastic, Worlds Familiar, planetary astronomer Bonnie Buratti writes

The study of Mercury … illustrates one of the most confounding bugaboos of the scientific method: the bandwagon effect. Scientists are only human, and they impose their own prejudices and foregone conclusions on their experiments.

Around 1800, Johann Schroeter determined that Mercury had a rotational period of 24 hours. This view held for eight decades.

In the 1880’s, Giovanni Schiaparelli determined that Mercury was tidally locked, making one rotation on its axis for every orbits around the sun. This view also held for eight decades.

In 1965, radar measurements of Mercury showed that Mercury completes 3 rotations in every 2 orbits around the sun.

Studying Mercury is difficult since it is only visible near the horizon and around sunrise and sunset, i.e. when the sun’s light interferes. And it is understandable that someone would confuse a 3:2 resonance with tidal locking. Still, for two periods of eight decades each, astronomers looked at Mercury and concluded what they expected.

The difficulty of seeing Mercury objectively was compounded by two incorrect but satisfying metaphors. First that Mercury was like Earth, rotating every 24 hours, then that Mercury was like the moon, orbiting the sun the same way the moon orbits Earth.

Buratti mentions the famous Millikan oil drop experiment as another example of the bandwagon effect.

… Millikan’s value for the electron’s charge was slightly in error—he had used a wrong value for the viscosity of air. But future experimenters all seemed to get Millikan’s number. Having done the experiment myself I can see that they just picked those values that agreed with previous results.

Buratti explains that Millikan’s experiment is hard to do and “it is impossible to successfully do it without abandoning most data.” This is what I like to call acceptance-rejection modeling.

The name comes from the acceptance-rejection method of random number generation. For example, the obvious way to generate truncated normal random values is to generate (unrestricted) normal random values and simply throw out the ones that lie outside the interval we’d like to truncate to. This is inefficient if we’re truncating to a small interval, but it always works. We’re conforming our samples to a pre-determined distribution, which is OK when we do it intentionally. The problem comes when we do it unintentionally.

Photo of Mercury above via NASA

Quantile-quantile plots and powers of 3/2

This post serves two purposes. It will empirically explore a question in number theory and demonstrate quantile-quantile (q-q) plots. It will shed light on a question raised in the previous post. And if you’re not familiar with q-q plots, it will serve as an introduction to such plots.

The previous post said that for almost all x > 1, the fractional parts of the powers of x are uniformly distributed. Although this is true for almost all x, it can be hard to establish for any particular x. The previous post ended with the question of whether the fractional parts of the powers of 3/2 are uniformly distributed.

First, lets just plot the sequence (3/2)n mod 1.

powers of 3/2 mod 1

Looks kinda random. But is it uniformly distributed? One way to tell would be to look at the empirical cumulative distribution function (ECDF) and see how it compares to a uniform cumulative distribution function. This is what a quantile-quantile plot does. In our case we’re looking to see whether something has a uniform distribution, but you could use a q-q plot for any distribution. It may be most often used to test normality by looking at whether the ECDF looks like a normal CDF.

If a sequence is uniformly distributed, we would expect 10% of the values to be less than 0.1. We would expect 20% of the values to be less than 0.2. Etc. In other words, we’d expect the quantiles to line up with their theoretical values, hence the name “quantile-quantile” plot. On the horizontal axis we plot uniform values between 0 and 1. On the vertical axis we plot the sorted values of (3/2)n mod 1.

qq plot of powers of 3/2 mod 1

A qq-plot indicates a good fit when values line up near the diagonal, as they do here.

For contrast, let’s look at a qq-plot for the powers of the plastic constant mod 1.

qq plot of powers of the plastic constant

Here we get something very far from the diagonal line. The plot is flat on the left because many of the values are near 0, and it’s flat on the right because many values are near 1.

Incidentally, the Kolmogorov-Smirnov goodness of fit test is basically an attempt to quantify the impression you get from looking at a q-q plot. It’s based on a statistic that measures how far apart the empirical CDF and theoretical CDF are.

Freudian hypothesis testing

Sigmund Freud

In his paper Mindless statistics, Gerd Gigerenzer uses a Freudian analogy to describe the mental conflict researchers experience over statistical hypothesis testing. He says that the “statistical ritual” of NHST (null hypothesis significance testing) “is a form of conflict resolution, like compulsive hand washing.”

In Gigerenzer’s analogy, the id represents Bayesian analysis. Deep down, a researcher wants to know the probabilities of hypotheses being true. This is something that Bayesian statistics makes possible, but more conventional frequentist statistics does not.

The ego represents R. A. Fisher’s significance testing: specify a null hypothesis only, not an alternative, and report a p-value. Significance is calculated after collecting the data. This makes it easy to publish papers. The researcher never clearly states his hypothesis, and yet takes credit for having established it after rejecting the null. This leads to feelings of guilt and shame.

The superego represents the Neyman-Pearson version of hypothesis testing: pre-specified alternative hypotheses, power and sample size calculations, etc. Neyman and Pearson insist that hypothesis testing is about what to do, not what to believe. [1]

* * *

I assume Gigerenzer doesn’t take this analogy too seriously. In context, it’s a humorous interlude in his polemic against rote statistical ritual.

But there really is a conflict in hypothesis testing. Researchers naturally think in Bayesian terms, and interpret frequentist results as if they were Bayesian. They really do want probabilities associated with hypotheses, and will imagine they have them even though frequentist theory explicitly forbids this. The rest of the analogy, comparing the ego and superego to Fisher and Neyman-Pearson respectively, seems weaker to me. But I suppose you could imagine Neyman and Pearson playing the role of your conscience, making you feel guilty about the pragmatic but unprincipled use of p-values.

* * *

[1] “No test based upon a theory of probability can by itself provide any valuable evidence of the truth or falsehood of a hypothesis. But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern behaviour in regard to them, in following which we insure that, in the long run of experience, we shall not often be wrong.”

Neyman J, Pearson E. On the problem of the most efficient tests of statistical hypotheses. Philos Trans Roy Soc A, 1933;231:289, 337.

Big data and the law

computer law

Excerpt from the new book Big Data of Complex Networks:

Big Data and data protection law provide for a number of mutual conflicts: from the perspective of Big Data analytics, a strict application of data protection law as we know it today would set an immediate end to most Big Data applications. From the perspective of the law, Big Data is either a big threat … or a major challenge for international and national lawmakers to adopt today’s data protection laws to the latest technological and economic developments.

Emphasis added.

The author of the chapter on legal matters is Swiss and writes primarily in a European context, though all countries face similar problems.

I’m not a lawyer, though I sometimes work with lawyers, and sometimes help companies with the statistical aspects of HIPAA law. But as a layman the observation above sounds reasonable to me, that strict application of the law could bring many applications to a halt, for better and for worse.

In my opinion the regulations around HIPAA and de-identification are mostly reasonable. The things it prohibits mostly should be prohibited. And it has a common sense provision in the form of expert determination. If your data uses fall outside the regulation’s specific recommendations but don’t endanger privacy, you can have an expert can certify that this is the case.

Sticky cards

Suppose you shuffle a deck of cards. How likely is it that there are two cards that were next to each other before the shuffle are still next to each other after the shuffle?

It depends on how well you shuffle the cards. If you do what’s called a “faro shuffle” then the probability of a pair of cards remaining neighbors is 0. In a faro shuffle, if the cards were originally numbered 1 through 52, the new arrangement will be 1, 27, 2, 28, 3, 29, …, 26, 52. All pairs are split up.

But if you shuffle the cards so well that the new arrangement is a random permutation of the original, there’s about an 86% chance that at least one pair of neighbors remain neighbors. To be more precise, consider permutations of n items. As n increases, the probability of no two cards remaining consecutive converges to exp(-2), and so the probability of at least two cards remaining next to each other converges to 1 – exp(-2).

By the way, this considers pairs staying next to each other in either order. For example, if the cards were numbered consecutively, then either a permutation with 7 followed by 8 or 8 followed by 7 would count.

More generally, for large n, the probability of k pairs remaining consecutive after a shuffle is 2ke-2 / k!.

One application of this result would be testing. If you randomly permute a list of things to break up consecutive pairs, it probably won’t work. Instead, you might want to split your list in half, randomize each half separately, then interleave the two half lists as in the faro shuffle.

Another application would be fraud detection. If you’re suspicious that a supposedly random process isn’t splitting up neighbors, you could use the result presented here to calibrate your expectations. Maybe what you’re seeing is consistent with good randomization. Or maybe not. Note that the result here looks at any pair remaining together. If a particular pair remains together consistently, that’s more suspicious.

Related posts:

Reference: David P. Robbins, The Probability that Neighbors Remain Neighbors After Random Rearrangements. The American Mathematical Monthly, Vol. 87, No. 2, pp. 122-124

Probability of secure hash collisions

A hash function maps arbitrarily long input strings to fixed-length outputs. For example, SHA-256 maps its input to a string of 256 bits. A cryptographically secure hash function h is a one-way function, i.e. given a message m it’s easy to compute h(m) but it’s not practical to go the other way, to learn anything about m from h(m). Secure hash functions are useful for message authentication codes because it is practically impossible to modify m without changing h(m).

Ideally, a secure hash is “indistinguishable from a random mapping.” [1]  So if a hash function has a range of size N, how many items can we send through the hash function before we can expect two items to have same hash value? By the pigeon hole principle, we know that if we hash N+1 items, two of them are certain to have the same hash value. But it’s likely that a much smaller number of inputs will lead to a collision, two items with the same hash value.

The famous birthday problem illustrates this. You could think of birthdays as a random mapping of people into 366 possible values [2]. In a room of less than 366 people, it’s possible that everyone has a different birthday. But in a group of 23 people, there are even odds that two people have the same birthday.

Variations on the birthday problem come up frequently. For example, in seeding random number generators. And importantly for this post, the birthday problem comes up in attacking hash functions.

When N is large, it is likely that hashing √N values will lead to a collision. We prove this below.

Proof

The proof below is a little informal. It could be made more formal by replacing the approximate equalities with equalities and adding the necessary little-o terms.

Suppose we’re hashing n items to a range of size N = n2. The exact probability that all n items have unique hash values is given in here. Taking the log of both sides gives us the first line of the proof below.

\log( \mbox{Prob(unique)}) &=& \log (\Gamma(n^2)) - \log (\Gamma(n^2 - n)) - n \log( n^2 )\\ &\approx& (n^2 - n) \log(n^2) - (n^2 - n) \log(n^2 - n) - n \\ &=& (n^2 - n) \log\left( \frac{n^2}{n^2 -n} \right) - n \\ &=& (n^2 -n) \log\left( 1 + \frac{1}{n-1} \right) - n \\ &\approx& (n^2 - n) \left( \frac{1}{n-1} - \frac{1}{2}\frac{1}{(n-1)^2} \right) \\ &=& -\frac{n}{2(n-1)} \\ &\approx& -\frac{1}{2}

The first approximation is based on the first three terms in the asymptotic expansion for log Γ given here, applied to both log gamma expressions. (The third terms from the two asymptotic series are the same so they cancel out.) The second line isn’t exactly what you’d get by applying the asymptotic expansion. It’s been simplified a little. The neglected terms are not a mistake but terms that can be left out because they go to zero.

The second approximation comes from using the first two terms in the power series for log(1 + x). One term isn’t enough since that would reduce to zero. The final approximation is simply taking the limit as n goes to infinity. Concretely, we’re willing to say that a billion and one divided by a billion is essentially 1.

Conclusions

So the probability of no collisions is exp(-1/2) or about 60%, which means there’s a 40% chance of at least one collision. As a rule of thumb, a hash function with range of size N can hash on the order of √N values before running into collisions.

This means that with a 64-bit hash function, there’s about a 40% chance of collisions when hashing 232 or about 4 billion items. If the output of the hash function is discernibly different from random, the probability of collisions may be higher. A 64-bit hash function cannot be secure since an attacker could easily hash 4 billion items. A 256-bit or 512-bit hash could in principle be secure since one could expect to hash far more items before collisions are likely. Whether a particular algorithm like SHA3-512 is actually secure is a matter for cryptologists, but it is at least feasible that a hash with a 512-bit range could be secure, based on the size of its range, while a 64-bit hash cannot be.

Numerical calculation

We used an asymptotic argument above rather than numerically evaluating the probabilities because this way we get a more general result. But even if we were only interested in a fix but large n, we’d run into numerical problems. This is one of those not uncommon cases where a pencil-and-paper approximation is actually more accurate than direct calculation with no (explicit) approximations.

There are numerous numerical problems with direct calculation of the collision probability. First, without taking logarithms we’d run into overflow and underflow. Second, for large enough n, n2n = n2 in floating point representation. IEEE 754 doubles have 53 bits of precision. This isn’t enough to distinguish values that differ, say, in the 128th bit. Finally, the two log gamma terms are large, nearly equal numbers. The cardinal rule of numerical analysis is to avoid subtracting nearly equal numbers. If two numbers agree to k bits, you could lose k bits of precision in carrying out their difference. See Avoiding overflow, underflow, and loss of precision for more along these lines.

Notes

[1] Cryptography Engineering by Ferguson, Schneier, and Kohno

[2] Leap year of course complicates things since February 29 birthdays are less common than other birthdays. Another complication is that birthdays are not entirely evenly distributed for the other days of the year. But these complications don’t ruin the party trick: In a room of 30 people, two people usually share a birthday.

Subjectivity in statistics

Andrew Gelman on subjectivity in statistics:

Bayesian methods are often characterized as “subjective” because the user must choose a prior distribution, that is, a mathematical expression of prior information. The prior distribution requires information and user input, that’s for sure, but I don’t see this as being any more “subjective” than other aspects of a statistical procedure, such as the choice of model for the data (for example, logistic regression) or the choice of which variables to include in a prediction, the choice of which coefficients should vary over time or across situations, the choice of statistical test, and so forth. Indeed, Bayesian methods can in many ways be more “objective” than conventional approaches in that Bayesian inference, with its smoothing and partial pooling, is well adapted to including diverse sources of information and thus can reduce the number of data coding or data exclusion choice points in an analysis.

People worry about prior distributions, not because they’re subjective, but because they’re explicitly subjective. There are many other subjective factors, common to Bayesian and Frequentist statistics, but these are implicitly subjective.

In practice, prior distributions often don’t make much difference. For example, you might show that an optimistic prior and a pessimistic prior lead to the same conclusion.

If you have so little data that the choice of prior does make a substantial difference, being able to specify the prior is a benefit. Suppose you only have a little data but have to make a decision anyway. A frequentist might say there’s too little data for a statistical analysis to be meaningful. So what do you do? Make a decision entirely subjectively! But with a Bayesian approach, you capture what is known outside of the data at hand in the form of a prior distribution, and update the prior with the little data you have. In this scenario, a Bayesian analysis is less subjective and more informed by the data than a frequentist approach.

Random squares

In geometry, you’d say that if a square has side x, then it has area x2.

In calculus, you’d say more. First you’d say that if a square has side near x, then it has area near x2. That is, area is a continuous function of the length of a side. As the length of the side changes, there’s never an abrupt jump in area. Next you could be more specific and say that a small change Δx to a side of length x corresponds to approximately a change of 2x Δx in the area.

In probability, you ask what is the area of a square like if you pick the length of its side at random. If you pick the length of the side from a distribution with mean μ, does the distribution of the area have mean μ2? No, but if the probability distribution on side length is tightly concentrated around μ, then the distribution on area will be concentrated near μ2. And you can approximate just how near the area is to μ2 using the delta method, analogous to the calculus discussion above.

If the distribution on side lengths is not particularly concentrated, finding the distribution on the area is more interesting. It will depend on the specific distribution on side length, and the mean area might not be particularly close to the square of the mean side length. The function to compute area is trivial, and yet the question of what happens when you stick a random variable into that function is not trivial. Random variables behave as you might expect when you stick them into linear functions, but offer surprises when you stick them into nonlinear functions.

Suppose you pick the length of the side of a square uniformly from the interval [0, 1]. Then the average side is 1/2, and so you might expect the average area to be 1/4. But the expected area is actually 1/3. You could see this a couple ways, analytically and empirically.

First an analytical derivation. If X has a uniform [0, 1] distribution and ZX2, then the CDF of Z is

Prob(Zz) = Prob(X ≤ √z) = √ z.

and so the PDF for Z, the derivative of the CDF, is -1/2√z. From there you can compute the expected value by integrating z times the PDF.

You could check your calculations by seeing whether simulation gives you similar results. Here’s a little Python code to do that.

      from random import random
      N = 1000000
      print( sum([random()**2 for _ in range(N)] )/N )

When I run this, I get 0.33386, close to 1/3.

Now lets look at an exponential distribution on side length with mean 1. Then a calculation similar to the one above shows that the expected value of the product is 2. You can also check this with simulation. This time we’ll be a little fancier and let SciPy generate our random values for us.

      print( sum(expon.rvs(size=N)**2)/N )

When I ran this, I got 1.99934, close to the expected value of 2.

You’ll notice that in both examples, the expected value of the area is more than the square of the expected value of the side. This is not a coincidence but consequence of Jensen’s inequality. Squaring is a convex function, so the expected value of the square is larger than the square of the expected value for any random variable.

Normal hazard continued fraction

The hazard function of a probability distribution is the instantaneous probability density of an event given that it hasn’t happened yet. This works out to be the ratio of the PDF (probability density function) to the CCDF (complementary cumulative density function).

For the standard normal distribution, the hazard function is

h(x) = \frac{\exp(-x^2/2)}{\int_x^\infty \exp(-t^2/2)\,dt}

and has a surprisingly simple continued fraction representation:

h(x) = 1 + \cfrac{1}{x+\cfrac{2}{x+\cfrac{3}{x+\cfrac{4}{x+\cdots}}}}

Aside from being an elegant curiosity, this gives an efficient way to compute the hazard function for large x. (It’s valid for any positive x, but most efficient for large x.)

Source: A&S equation 26.2.14

Related posts:

Interim analysis, futility monitoring, and predictive probability

An interim analysis of a clinical trial is an unusual analysis. At the end of the trial you want to estimate how well some treatment X works. For example, you want to how likely is it that treatment X works better than the control treatment Y. But in the middle of the trial you want to know something more subtle.

It’s possible that treatment X is doing so poorly that you want to end the trial without going any further. It’s also possible that X is doing so well that you want to end the trial early. Both of these are rare. Most of the time an interim analysis is more concerned with futility. You might want to stop the trial early not because the results are really good, or really bad, but because the results are really mediocre! That is, treatments and Y are performing so similarly that you’re afraid that you won’t be able to declare one or the other better.

Maybe treatment X is doing a little better than Y, but not so much better that you can declare with confidence that X is better. You might want to stop for futility if you project that not only do you not have enough evidence now, you don’t believe you will have enough evidence by the end of the trial.

Futility analysis is more about resources than ethics. If X is doing poorly, ethics might dictate that you stop giving X to patients so you stop early. If X is doing spectacularly well, ethics might dictate that you stop giving the control treatment, if there is an active control. But if X is doing so-so, there’s usually not an ethical reason to stop, unless X is worse than Y on some secondary criteria, such as having worse side effects. You want to end futile studies so you can save resources and get on with the next study, and you could argue that’s an ethical consideration, though less direct.

Futility analysis isn’t about your current estimate of effectiveness. It’s about what you think you’re estimate regard effectiveness in the future. That is, it’s a second order prediction. You’re trying to understand the effectiveness of the trial, not of the treatment per se. You’re not trying to estimate a parameter, for example, but trying to estimate what range of estimates you’re likely to make.

This is why predictive probability is natural for interim analysis. You’re trying to predict outcomes, not parameters. (This is subtle: you’re trying to estimate the probability of outcomes that lead to certain estimates of parameters, namely those that allow you to reach a conclusion with pre-specified significance.)

Predictive probability is a Bayesian concept, but it is useful in analyzing frequentist trial designs. You may have frequentist conclusion criteria, such as a p-value threshhold or some requirements on a confidence interval, but you want to know how likely it is that if the trial continues, you’ll see data that lead to meeting your criteria. In that case you want to compute the (Bayesian) predictive probability of meeting your frequentist criteria!

Click to learn more about Bayesian statistics consulting

Mathematical modeling for medical devices

We’re about to see a lot of new, powerful, inexpensive medical devices come out. And to my surprise, I’ve contributed to a few of them.

Growing compute power and shrinking sensors open up possibilities we’re only beginning to explore. Even when the things we want to observe elude direct measurement, we may be able to infer them from other things that we can now measure accurately, inexpensively, and in high volume.

In order to infer what you’d like to measure from what you can measure, you need a mathematical model. Or if you’d like to make predictions about the future from data collected in the past, you need a model. And that’s where I come in. Several companies have hired me to help them create medical devices by working on mathematical models. These might be statistical models, differential equations, or a combination of the two. I can’t say much about the projects I’ve worked on, at least not yet. I hope that I’ll be able to say more once the products come to market.

I started my career doing mathematical modeling (partial differential equations) but wasn’t that interested in statistics or medical applications. Then through an unexpected turn of events, I ended up spending a dozen years working in the biostatistics department of the world’s largest cancer center.

After leaving MD Anderson and starting my consultancy, several companies have approached me for help with mathematical problems associated with their idea for a medical device. These are ideal projects because they combine my earlier experience in mathematical modeling with my more recent experience with medical applications.

If you have an idea for a medical device, or know someone who does, let’s talk. I’d like to help.

 

Uncertainty in a probability

Suppose you did a pilot study with 10 subjects and found a treatment was effective in 7 out of the 10 subjects.

With no more information than this, what would you estimate the probability to be that the treatment is effective in the next subject? Easy: 0.7.

Now what would you estimate the probability to be that the treatment is effective in the next two subjects? You might say 0.49, and that would be correct if we knew that the probability of response is 0.7. But there’s uncertainty in our estimate. We don’t know that the response rate is 70%, only that we saw a 70% response rate in our small sample.

If the probability of success is p, then the probability of s successes and f failures in the next sf subjects is given by

{s+f \choose s} p^s (1-p)^f

But if our probability of success has some uncertainty and we assume it has a beta(ab) distribution, then the predictive probability of s successes and f failures is given by

{s+f \choose s} \frac{B(a+s, b+f)}{B(a,b)}

where

B(x, y) = \frac{\Gamma(x)\, \Gamma(y)}{\Gamma(x+y)}

In our example, after seeing 7 successes out of 10 subjects, we estimate the probability of success by a beta(7, 3) distribution. Then this says the predictive probability of two successes is approximately 0.51, a little higher than the naive estimate of 0.49. Why is this?

We’re not assuming the probability of success is 0.7, only that the mean of our estimate of the probability is 0.7. The actual probability might be higher or lower. The predictive probability calculates the probability of outcomes under all possible values of the probability, then creates a weighted average, weighing each probability of success by the probability of that value. The differences corresponding to probability above and below 0.7 approximately balance out, but the former carry a little more weight and so we get roughly what we did before.

If this doesn’t seem right, note that mean and median aren’t the same thing for asymmetric distributions. A beta(7,3) distribution has mean 0.7, but it has a probability of 0.537 of being larger than 0.7.

If our initial experiment has shown 70 successes out of 100 instead of 7 out of 10, the predictive probability of two successes would have been 0.492, closer to the value based on point estimate, but still different.

The further we look ahead, the more difference there is between using a point estimate and using a distribution that incorporates our uncertainty. Here are the probabilities for the number of successes out of the next 100 outcomes, using the point estimate 0.3 and using predictive probability with a beta(7,3) distribution.

So if we’re sure that the probability of success is 0.7, we’re pretty confident that out of 100 trials we’ll see between 60 and 80 successes. But if we model our uncertainty in the probability of response, we get quite a bit of uncertainty when we look ahead to the next 100 subjects. Now we can say that the number of responses is likely to be between 30 and 100.

Click to learn more about Bayesian statistics consulting

Less likely to get half, more likely to get near half

I was catching up on Engines of our Ingenuity episodes this evening when the following line jumped out at me:

If I flip a coin a million times, I’m virtually certain to get 50 percent heads and 50 percent tails.

Depending on how you understand that line, it’s either imprecise or false. The more times you flip the coin, the more likely you are to get nearly half heads and half tails, but the less likely you are to get exactly half of each. I assume Dr. Lienhard knows this and that by “50 percent” he meant “nearly half.”

Let’s make the fuzzy statements above more quantitative. Suppose we flip a coin 2n times for some large number n. Then a calculation using Stirling’s approximation shows that the probability of n heads and n tails is approximately

1/√(πn)

which goes to zero as n goes to infinity. If you flip a coin a million times, there’s less than one chance in a thousand that you’d get exactly half heads.

Next, let’s quantify the statement that nearly half the tosses are likely to be heads. The normal approximation to the binomial tells us that for large n, the number of heads out of 2n tosses is approximately distributed like a normal distribution with the same mean and variance, i.e. mean n and variance n/2. The proportion of heads is thus approximately normal with mean 1/2 and variance 1/8n. This means the standard deviation is 1/√(8n). So, for example, about 95% of the time the proportion of heads will be 1/2 plus or minus 2/√(8n). As n goes to infinity, the width of this interval goes to 0. Alternatively, we could pick some fixed interval around 1/2 and show that the probability of the proportion of heads being outside that interval goes to 0.

Insufficient statistics

Experience with the normal distribution makes people think all distributions have (useful) sufficient statistics [1]. If you have data from a normal distribution, then the sufficient statistics are the sample mean and sample variance. These statistics are “sufficient” in that the entire data set isn’t any more informative than those two statistics. They effectively condense the data for you. (This is conditional on knowing the data come from a normal. More on that shortly.)

With data from other distributions, the mean and variance may not be sufficient statistics, and in fact there may be no (useful) sufficient statistics. The full data set is more informative than any summary of the data. But out of habit people may think that the mean and variance are enough.

Probability distributions are an idealization, of course, and so data never exactly “come from” a distribution. But if you’re satisfied with a distributional idealization of your data, there may be useful sufficient statistics.

Suppose you have data with such large outliers that you seriously doubt that it could be coming from anything appropriately modeled as a normal distribution. You might say the definition of sufficient statistics is wrong, that the full data set tells you something you couldn’t know from the summary statistics. But the sample mean and variance are still sufficient statistics in this case. They really are sufficient, conditional on the normality assumption, which you don’t believe! The cognitive dissonance doesn’t come from the definition of sufficient statistics but from acting on an assumption you believe to be false.

***

[1] Technically every distribution has sufficient statistics, though the sufficient statistic might be the same size as the original data set, in which case the sufficient statistic hasn’t contributed anything useful. Roughly speaking, distributions have useful sufficient statistics if they come from an “exponential family,” a set of distributions whose densities factor a certain way.

Mittag-Leffler function and probability distribution

The Mittag-Leffler function is a generalization of the exponential function. Since k!= Γ(k + 1), we can write the exponential function’s power series as

\exp(x) = \sum_{k=0}^\infty \frac{x^k}{\Gamma(k+1)}

and we can generalize this to the Mittag=Leffler function

E_{\alpha, \beta}(x) = \sum_{k=0}^\infty \frac{x^k}{\Gamma(\alpha k+\beta)}

which reduces to the exponential function when α = β = 1. There are a few other values of α and β for which the Mittag-Leffler function reduces to more familiar functions. For example,

E_{2,1}(x) = \cosh(\sqrt{x})

and

E_{1/2, 1}(x) = \exp(x^2) \, \mbox{erfc}(-x)

where erfc(x) is the complementary error function.

History

Mittag-Leffler was one person, not two. When I first saw the Mittag-Leffler theorem in complex analysis, I assumed it was named after two people, Mittag and Leffler. But the theorem and the function discussed here are named after one man, the Swedish mathematician Magnus Gustaf (Gösta) Mittag-Leffler (1846–1927).

The function that Mr. Mittag-Leffler originally introduced did not have a β parameter; that generalization came later. The function Eα is Eα, 1.

Mittag-Leffler probability distributions

Just as you can make a couple probability distributions out of the exponential function, you can make a couple probability distributions out of the Mittag-Leffler function.

Continuous Mittag-Leffler distribution

The exponential function exp(-x) is positive over [0, ∞) and integrates to 1, so we can define a probability distribution whose density (PDF) function is f(x) = exp(-x) and whose distribution function (CDF) is F(x) = 1 – exp(-x). The Mittag-Leffler distribution has CDF is 1 – Eα(-xα) and so reduces to the exponential distribution when α = 1. For 0 < α < 1, the Mittag-Leffler distribution is a heavy-tailed generalization of the exponential. [1]

Discrete Mittag-Leffler distribution

The Poisson distribution comes from taking the power series for exp(λ), normalizing it to 1, and using the kth term as the probability mass for k. That is,

P(X = k) = \frac{1}{\exp(\lambda)} \frac{\lambda^k}{k!}

The analogous discrete Mittag-Leffler distribution [2] has probability mass function

P(X = k) = \frac{1}{E_{\alpha, \beta}(\lambda)} \frac{\lambda^k}{\Gamma(\alpha k + \beta)}.

Fractional differential equations

In addition to probability and statistics, the the Mittag-Leffler function comes up in fractional calculus. It plays a role analogous to that of the exponential distribution in classical calculus. Just as the solution to the simply differential equation

\frac{d}{dx} f(x) = a\, f(x)

is exp(ax), for 0 < μ < 1, the soltuion to the fractional differential equation

\frac{d^\mu}{dx^\mu} f(x) = a\, f(x)

is axμ-1 Eμ, μ(a xμ). Note that this reduces to exp(ax) when μ = 1. [3]

References

[1] Gwo Dong Lin. Journal of Statistical Planning and Inference 74 (1998) 1–9, On the Mittag–Leffler distributions

[2] Subrata Chakraborty, S. H. Ong. Mittag-Leffler function distribution: A new generalization of hyper-Poisson distribution. arXiv:1411.0980v1

[3] Keith Oldham, Jan Myland, Jerome Spanier. An Atlas of Functions. Springer.