The journal article announcing PCG gives the results of testing it with the TestU01 test suite. I wanted to try it out by testing it with the DIEHARDER test suite (Robert G. Brown’s extension of George Marsaglia’s DIEHARD test suite) and the NIST Statistical Test Suite. I used what the generator’s website calls the “minimal C implementation.”

(The preprint of the journal article is dated 2015 but apparently hasn’t been published yet.)

For the NIST test suite, I generated 10,000,000 bits and divided them into 10 streams.

For the DIEHARDER test suite, I generated 800,000,000 unsigned 32-bit integers. (DIEHARDER requires a **lot** of random numbers as input.)

For both test suites I used the seed (`state`

) 20170707105851 and sequence constant (`inc`

) 42.

The PCG generator did well on all the NIST tests. For every test, at least least 9 out of 10 streams passed. The test authors say you should expect at least 8 out of 10 streams to pass.

Here’s an excerpt from the results. You can find the full results here.

-------------------------------------------------------- C1 C2 C3 ... C10 P-VALUE PROPORTION STATISTICAL TEST -------------------------------------------------------- 2 0 2 0 0.213309 10/10 Frequency 0 0 1 3 0.534146 10/10 BlockFrequency 3 0 0 0 0.350485 10/10 CumulativeSums 1 1 0 2 0.350485 10/10 CumulativeSums 0 2 2 1 0.911413 10/10 Runs 0 0 1 1 0.534146 10/10 LongestRun 0 1 2 0 0.739918 10/10 Rank 0 4 0 0 0.122325 10/10 FFT 1 0 0 1 0.000439 10/10 NonOverlappingTemplate ... 2 1 0 0 0.350485 9/10 NonOverlappingTemplate 0 2 1 0 0.739918 10/10 OverlappingTemplate 1 1 0 2 0.911413 10/10 Universal 1 1 0 0 0.017912 10/10 ApproximateEntropy 1 0 1 1 ---- 3/4 RandomExcursions ... 0 0 0 1 ---- 4/4 RandomExcursions 2 0 0 0 ---- 4/4 RandomExcursionsVariant ... 0 0 3 0 ---- 4/4 RandomExcursionsVariant 1 2 3 0 0.350485 9/10 Serial 1 1 1 0 0.739918 10/10 Serial 1 2 0 0 0.911413 10/10 LinearComplexity ...

The DIEHARDER suite has 31 kinds tests, some of which are run many times, making a total of 114 tests. Out of the 114 tests, two returned a weak pass for the PCG input and all the rest passed. A few weak passes are to be expected from running so many tests and so this isn’t a strike against the generator. In fact, it might be suspicious if no tests returned a weak pass.

Here’s an edited version of the results. The full results are here.

#=============================================================================# test_name |ntup| tsamples |psamples| p-value |Assessment #=============================================================================# diehard_birthdays| 0| 100| 100|0.46682782| PASSED diehard_operm5| 0| 1000000| 100|0.83602120| PASSED diehard_rank_32x32| 0| 40000| 100|0.11092547| PASSED diehard_rank_6x8| 0| 100000| 100|0.78938803| PASSED diehard_bitstream| 0| 2097152| 100|0.81624396| PASSED diehard_opso| 0| 2097152| 100|0.95589325| PASSED diehard_oqso| 0| 2097152| 100|0.86171368| PASSED diehard_dna| 0| 2097152| 100|0.24812341| PASSED diehard_count_1s_str| 0| 256000| 100|0.75417270| PASSED diehard_count_1s_byt| 0| 256000| 100|0.25725000| PASSED diehard_parking_lot| 0| 12000| 100|0.59288414| PASSED diehard_2dsphere| 2| 8000| 100|0.79652706| PASSED diehard_3dsphere| 3| 4000| 100|0.14978100| PASSED diehard_squeeze| 0| 100000| 100|0.35356584| PASSED diehard_sums| 0| 100| 100|0.04522121| PASSED diehard_runs| 0| 100000| 100|0.39739835| PASSED diehard_runs| 0| 100000| 100|0.99128296| PASSED diehard_craps| 0| 200000| 100|0.64934221| PASSED diehard_craps| 0| 200000| 100|0.27352733| PASSED marsaglia_tsang_gcd| 0| 10000000| 100|0.10570816| PASSED marsaglia_tsang_gcd| 0| 10000000| 100|0.00267789| WEAK sts_monobit| 1| 100000| 100|0.98166534| PASSED sts_runs| 2| 100000| 100|0.05017630| PASSED sts_serial| 1| 100000| 100|0.95153782| PASSED ... sts_serial| 16| 100000| 100|0.59342390| PASSED rgb_bitdist| 1| 100000| 100|0.50763759| PASSED ... rgb_bitdist| 12| 100000| 100|0.98576422| PASSED rgb_minimum_distance| 2| 10000| 1000|0.23378443| PASSED ... rgb_minimum_distance| 5| 10000| 1000|0.13215367| PASSED rgb_permutations| 2| 100000| 100|0.54142546| PASSED ... rgb_permutations| 5| 100000| 100|0.96040216| PASSED rgb_lagged_sum| 0| 1000000| 100|0.66587166| PASSED ... rgb_lagged_sum| 31| 1000000| 100|0.00183752| WEAK rgb_lagged_sum| 32| 1000000| 100|0.13582393| PASSED rgb_kstest_test| 0| 10000| 1000|0.74708548| PASSED dab_bytedistrib| 0| 51200000| 1|0.30789191| PASSED dab_dct| 256| 50000| 1|0.89665788| PASSED dab_filltree| 32| 15000000| 1|0.67278231| PASSED dab_filltree| 32| 15000000| 1|0.35348003| PASSED dab_filltree2| 0| 5000000| 1|0.18749029| PASSED dab_filltree2| 1| 5000000| 1|0.92600020| PASSED]]>

MCMC (Markov Chain Monte Carlo) gives us a way around this impasse. It lets us draw samples from practically any probability distribution. But there’s a catch: the samples are not independent. This lack of independence means that all the familiar theory on convergence of sums of random variables goes out the window.

There’s not much theory to guide assessing the convergence of sums of MCMC samples, but there are heuristics. One of these is **effective sample size **(ESS). The idea is to have a sort of “exchange rate” between dependent and independent samples. You might want to say, for example, that 1,000 samples from a certain Markov chain are worth about as much as 80 independent samples because the MCMC samples are highly correlated. Or you might want to say that 1,000 samples from a different Markov chain are worth about as much as 300 independent samples because although the MCMC samples are dependent, they’re weakly correlated.

Here’s the definition of ESS:

where *n* is the number of samples and ρ(*k*) is the correlation at lag *k*.

This behaves well in the extremes. If your samples are independent, your effective samples size equals the actual sample size. If the correlation at lag *k* decreases extremely slowly, so slowly that the sum in the denominator diverges, your effective sample size is zero.

Any reasonable Markov chain is between the extremes. Zero lag correlation is too much to hope for, but ideally the correlations die off fast enough that the sum in the denominator not only converges but also isn’t a terribly large value.

I’m not sure who first proposed this definition of ESS. There’s a reference to it in Handbook of Markov Chain Monte Carlo where the authors cite a paper [1] in which Radford Neal mentions it. Neal cites B. D. Ripley [2].

[1] Markov Chain Monte Carlo in Practice: A Roundtable Discussion. Robert E. Kass, Bradley P. Carlin, Andrew Gelman and Radford M. Neal. The American Statistician. Vol. 52, No. 2 (May, 1998), pp. 93-100

[2] Stochlastic Simulation, B. D. Ripley, 1987.

]]>The confidence region for the object flares out over time, something like the bell of a trumpet.

**Why does the region get larger**? Because there’s uncertainty in the velocity, and the velocity gets multiplied by elapsed time.

**Why isn’t the confidence region a cone**? Because that would ignore the uncertainty in the initial position. The result would be too small.

**Why isn’t the confidence region a truncated cone**? That’s not a bad approximation, though it’s a bit too large. If we ignore probability for a moment and treat confidence intervals as deterministic limits, then we get a truncated cone. For example, suppose assume position and velocity are each within two standard deviations of their estimates. Then we’d estimate position to be between *x*_{0} – 2σ_{x} + (*v_{0}* – 2σ

**So what is the confidence region**? It’s some where between the cone and the truncated cone.

The position *x* + *t* *v* is the sum of two random variables. The first has variance σ_{x}² and the second has variance *t*² σ_{v}². Variances of independent random variables add, so the standard deviation for the sum is

√(σ_{x}² + *t*² σ_{v}²) = *t* √(σ_{x}² / *t*² + σ_{v}²)

Note that as *t* increases, the latter approaches *t* σ_{v} from above. Ignoring the uncertainty in initial position underestimates standard deviation, but the relative error decreases as *t* increases.

For large *t*, a confidence interval for position at time *t* is approximately proportional to *t*, so the width of the confidence intervals over time look like a cone. But from small *t*, the dependence on *t* is less linear and more curved.

If *a* and *b* are close, they don’t have to be very large for the beta PDF to be approximately normal. (In all the plots below, the solid blue line is the beta distribution and the dashed orange line is the normal distribution with the same mean and variance.)

On the other hand, when *a* and *b* are very different, the beta distribution can be skewed and far from normal. Note that *a* + *b* is the same in the example above and below.

Why the sharp corner above? The beta distribution is only defined on the interval [0, 1] and so the PDF is zero for negative values.

An application came up today that raised an interesting question: What if *a* + *b* is very large, but *a* and *b* are very different? The former works in favor of the normal approximation but the latter works against it.

The application had a low probability of success but a very large number of trials. Specifically, *a* + *b* would be on the order of a million, but *a* would be less than 500. Does the normal approximation hold for these kinds of numbers? Here are some plots to see.

When *a* = 500 the normal approximation is very good. It’s still pretty good when *a* = 50, but not good at all when *a* = 5.

**Update**: Mike Anderson suggested using the skewness to quantify how far a beta is from normal. Good idea.

The skewness of a beta(*a*, *b*) distribution is

2(*b* – *a*)√(*a* +* b* + 1) / (*a* + *b* + 2) √(*ab*)

Let *N* = *a* + *b* and assume *N* is large and *a* is small, so that *N*, *N* + 2, *b – **a*, and *N* – *a* are all approximately equal in ratios. Then the skewness is approximately 2 /√*a*. In other words, once the number of trials is sufficiently large, sknewness hardly depends on the number of trials and only depends on the number of successes.

The previous post said that for almost all *x* > 1, the fractional parts of the powers of *x* are uniformly distributed. Although this is true for almost all *x*, it can be hard to establish for any particular *x*. The previous post ended with the question of whether the fractional parts of the powers of 3/2 are uniformly distributed.

First, lets just plot the sequence (3/2)^{n} mod 1.

Looks kinda random. But is it uniformly distributed? One way to tell would be to look at the empirical cumulative distribution function (ECDF) and see how it compares to a uniform cumulative distribution function. This is what a quantile-quantile plot does. In our case we’re looking to see whether something has a uniform distribution, but you could use a q-q plot for any distribution. It may be most often used to test normality by looking at whether the ECDF looks like a normal CDF.

If a sequence is uniformly distributed, we would expect 10% of the values to be less than 0.1. We would expect 20% of the values to be less than 0.2. Etc. In other words, we’d expect the *quantiles* to line up with their theoretical values, hence the name “quantile-quantile” plot. On the horizontal axis we plot uniform values between 0 and 1. On the vertical axis we plot the sorted values of (3/2)^{n} mod 1.

A qq-plot indicates a good fit when values line up near the diagonal, as they do here.

For contrast, let’s look at a qq-plot for the powers of the plastic constant mod 1.

Here we get something very far from the diagonal line. The plot is flat on the left because many of the values are near 0, and it’s flat on the right because many values are near 1.

Incidentally, the Kolmogorov-Smirnov goodness of fit test is basically an attempt to quantify the impression you get from looking at a q-q plot. It’s based on a statistic that measures how far apart the empirical CDF and theoretical CDF are.

]]>In his paper Mindless statistics, Gerd Gigerenzer uses a Freudian analogy to describe the mental conflict researchers experience over statistical hypothesis testing. He says that the “statistical ritual” of NHST (null hypothesis significance testing) “is a form of conflict resolution, like compulsive hand washing.”

In Gigerenzer’s analogy, the **id** represents Bayesian analysis. Deep down, a researcher wants to know the probabilities of hypotheses being true. This is something that Bayesian statistics makes possible, but more conventional frequentist statistics does not.

The **ego** represents R. A. Fisher’s significance testing: specify a null hypothesis only, not an alternative, and report a *p*-value. Significance is calculated after collecting the data. This makes it easy to publish papers. The researcher never clearly states his hypothesis, and yet takes credit for having established it after rejecting the null. This leads to feelings of guilt and shame.

The **superego** represents the Neyman-Pearson version of hypothesis testing: pre-specified alternative hypotheses, power and sample size calculations, etc. Neyman and Pearson insist that hypothesis testing is about what to *do*, not what to *believe*. [1]

I assume Gigerenzer doesn’t take this analogy too seriously. In context, it’s a humorous interlude in his polemic against rote statistical ritual.

But there really is a conflict in hypothesis testing. Researchers naturally think in Bayesian terms, and interpret frequentist results as if they were Bayesian. They really do want probabilities associated with hypotheses, and will imagine they have them even though frequentist theory explicitly forbids this. The rest of the analogy, comparing the ego and superego to Fisher and Neyman-Pearson respectively, seems weaker to me. But I suppose you could imagine Neyman and Pearson playing the role of your conscience, making you feel guilty about the pragmatic but unprincipled use of *p*-values.

* * *

[1] “No test based upon a theory of probability can by itself provide any valuable evidence of the truth or falsehood of a hypothesis. But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern behaviour in regard to them, in following which we insure that, in the long run of experience, we shall not often be wrong.”

Neyman J, Pearson E. On the problem of the most efficient tests of statistical hypotheses. *Philos Trans Roy Soc A*, 1933;231:289, 337.

Excerpt from the new book Big Data of Complex Networks:

Big Data and data protection law provide for a number of mutual conflicts: from the perspective of Big Data analytics,

. From the perspective of the law, Big Data is either a big threat … or a major challenge for international and national lawmakers to adopt today’s data protection laws to the latest technological and economic developments.a strict application of data protection law as we know it today would set an immediate end to most Big Data applications

Emphasis added.

The author of the chapter on legal matters is Swiss and writes primarily in a European context, though all countries face similar problems.

I’m not a lawyer, though I sometimes work with lawyers as a technical expert, and sometimes help companies with the statistical aspects of HIPAA law. But as a layman the observation above sounds reasonable to me, that strict application of the law could bring many applications to a halt, for better and for worse.

In my opinion the regulations around HIPAA and de-identification are mostly reasonable. The things it prohibits mostly should be prohibited. And it has a common sense provision in the form of expert determination. If your data uses fall outside the regulation’s specific recommendations but don’t endanger privacy, you can have an expert can certify that this is the case.

**Related**:

Bayesian methods are often characterized as “subjective” because the user must choose a prior distribution, that is, a mathematical expression of prior information. The prior distribution requires information and user input, that’s for sure, but I don’t see this as being any more “subjective” than other aspects of a statistical procedure, such as the choice of model for the data (for example, logistic regression) or the choice of which variables to include in a prediction, the choice of which coefficients should vary over time or across situations, the choice of statistical test, and so forth. Indeed, Bayesian methods can in many ways be more “objective” than conventional approaches in that Bayesian inference, with its smoothing and partial pooling, is well adapted to including diverse sources of information and thus can reduce the number of data coding or data exclusion choice points in an analysis.

People worry about prior distributions, not because they’re subjective, but because they’re *explicitly* subjective. There are many other subjective factors, common to Bayesian and Frequentist statistics, but these are *implicitly* subjective.

In practice, prior distributions often don’t make much difference. For example, you might show that an optimistic prior and a pessimistic prior lead to the same conclusion.

If you have so little data that the choice of prior does make a substantial difference, being able to specify the prior is a benefit. Suppose you only have a little data but have to make a decision anyway. A frequentist might say there’s too little data for a statistical analysis to be meaningful. So what do you do? Make a decision entirely subjectively! But with a Bayesian approach, you capture what is known outside of the data at hand in the form of a prior distribution, and update the prior with the little data you have. In this scenario, a Bayesian analysis is less subjective and more informed by the data than a frequentist approach.

]]>It’s possible that treatment *X* is doing so poorly that you want to end the trial without going any further. It’s also possible that *X* is doing so well that you want to end the trial early. Both of these are rare. Most of the time an interim analysis is more concerned with **futility**. You might want to stop the trial early not because the results are really good, or really bad, but because the results are really mediocre! That is, treatments *X *and *Y* are performing so similarly that you’re afraid that you won’t be able to declare one or the other better.

Maybe treatment *X* is doing a little better than *Y*, but not so much better that you can declare with confidence that *X* is better. You might want to stop for futility if you project that not only do you not have enough evidence now, you don’t believe you will have enough evidence by the end of the trial.

Futility analysis is more about resources than ethics. If *X* is doing poorly, ethics might dictate that you stop giving *X* to patients so you stop early. If *X* is doing spectacularly well, ethics might dictate that you stop giving the control treatment, if there is an active control. But if *X* is doing so-so, there’s usually not an ethical reason to stop, unless *X* is worse than *Y* on some secondary criteria, such as having worse side effects. You want to end futile studies so you can save resources and get on with the next study, and you could argue that’s an ethical consideration, though less direct.

Futility analysis isn’t about your current estimate of effectiveness. It’s about what you think you’re estimate regard effectiveness in the future. That is, it’s a second order prediction. You’re trying to understand the effectiveness of the **trial**, not of the **treatment** per se. You’re not trying to estimate a parameter, for example, but trying to estimate what range of estimates you’re likely to make.

This is why **predictive probability** is natural for interim analysis. You’re trying to predict **outcomes**, not **parameters**. (This is subtle: you’re trying to estimate the probability of **outcomes **that lead to certain estimates of **parameters**, namely those that allow you to reach a conclusion with pre-specified significance.)

Predictive probability is a Bayesian concept, but it is useful in analyzing frequentist trial designs. You may have frequentist conclusion criteria, such as a *p*-value threshhold or some requirements on a confidence interval, but you want to know how likely it is that if the trial continues, you’ll see data that lead to meeting your criteria. In that case you want to compute the (**Bayesian**) predictive probability of meeting your **frequentist** criteria!

Most people learn R as they learn statistics: Here’s a statistical concept, and here’s how you can compute it in R. Statisticians aren’t that interested in the R language itself but see it as connective tissue between commands that are their primary interest.

This works for statisticians, but it makes the language hard for non-statisticians to approach. Years ago I managed a group of programmers who supported statisticians. At the time, there were no books for learning R without concurrently learning statistics. This created quite a barrier to entry for programmers whose immediate concern was not the statistical content of an R program.

Now there are more books on R, and some are more approachable to non-statisticians. The most accessible one I’ve seen so far is Learning Base R by Lawrence Leemis. It gets into statistical applications of R—that is ultimately why anyone is interested in R—but it doesn’t *start* there. The first 40% or so of the book is devoted to basic language features, things you’re supposed to pick up by osmosis from a book focused more on statistics than on R *per se*. This is the book I wish I could have handed my programmers who had to pick up R.

With no more information than this, what would you estimate the probability to be that the treatment is effective in the next subject? Easy: 0.7.

Now what would you estimate the probability to be that the treatment is effective in the next **two** subjects? You might say 0.49, and that would be correct if we **knew*** *that the probability of response is 0.7. But there’s uncertainty in our estimate. We don’t know that the response rate is 70%, only that we saw a 70% response rate in our small sample.

If the probability of success is *p*, then the probability of *s* successes and *f* failures in the next *s* + *f* subjects is given by

But if our probability of success has some uncertainty and we assume it has a beta(*a*, *b*) distribution, then the **predictive probability** of *s* successes and *f* failures is given by

where

In our example, after seeing 7 successes out of 10 subjects, we estimate the probability of success by a beta(7, 3) distribution. Then this says the predictive probability of two successes is approximately 0.51, a little higher than the naive estimate of 0.49. Why is this?

We’re not assuming the probability of success is 0.7, only that the mean of our estimate of the probability is 0.7. The actual probability might be higher or lower. The predictive probability calculates the probability of outcomes under all possible values of the probability, then creates a weighted average, weighing each probability of success by the probability of that value. The differences corresponding to probability above and below 0.7 approximately balance out, but the former carry a little more weight and so we get roughly what we did before.

If this doesn’t seem right, note that mean and median aren’t the same thing for asymmetric distributions. A beta(7,3) distribution has mean 0.7, but it has a probability of 0.537 of being larger than 0.7.

If our initial experiment has shown 70 successes out of 100 instead of 7 out of 10, the predictive probability of two successes would have been 0.492, closer to the value based on point estimate, but still different.

The further we look ahead, the more difference there is between using a point estimate and using a distribution that incorporates our uncertainty. Here are the probabilities for the number of successes out of the next 100 outcomes, using the point estimate 0.3 and using predictive probability with a beta(7,3) distribution.

So if we’re sure that the probability of success is 0.7, we’re pretty confident that out of 100 trials we’ll see between 60 and 80 successes. But if we model our uncertainty in the probability of response, we get quite a bit of uncertainty when we look ahead to the next 100 subjects. Now we can say that the number of responses is likely to be between 30 and 100.

]]>With data from other distributions, the mean and variance may not be sufficient statistics, and in fact there may be no (useful) sufficient statistics. The full data set is more informative than any summary of the data. But out of habit people may think that the mean and variance are enough.

Probability distributions are an idealization, of course, and so data never exactly “come from” a distribution. But if you’re satisfied with a distributional idealization of your data, there may be useful sufficient statistics.

Suppose you have data with such large outliers that you seriously doubt that it could be coming from anything appropriately modeled as a normal distribution. You might say the definition of sufficient statistics is wrong, that the full data set tells you something you couldn’t know from the summary statistics. But the sample mean and variance are still sufficient statistics in this case. They really are sufficient, *conditional on the normality assumption*, which you don’t believe! The cognitive dissonance doesn’t come from the definition of sufficient statistics but from acting on an assumption you believe to be false.

***

[1] Technically every distribution has sufficient statistics, though the sufficient statistic might be the same size as the original data set, in which case the sufficient statistic hasn’t contributed anything useful. Roughly speaking, distributions have useful sufficient statistics if they come from an “exponential family,” a set of distributions whose densities factor a certain way.

]]>For example, take English letter frequencies. These frequencies are fairly well known. E is the most common letter, followed by T, then A, etc. The string of letters “ETAOIN SHRDLU” comes from the days of Linotype when letters were arranged in that order, in decreasing order of frequency. Sometimes you’d see ETAOIN SHRDLU in print, just as you might see “QWERTY” today.

Morse code is also based on English letter frequencies. The length of a letter in Morse code varies approximately inversely with its frequency, a sort of precursor to Huffman encoding. The most common letter, E, is a single dot, while the rarer letters like J and Q have a dot and three dashes. (So does Y, even though it occurs more often than some letters with shorter codes.)

So how frequently does the letter E, for example, appear in English? That depends on what you mean by English. You can count how many times it appears, for example, in a particular edition of A Tale of Two Cities, but that isn’t the same as it’s frequency in English. And if you’d picked the novel Gadsby instead of A Tale of Two Cities you’d get very different results since that book was written without using a single letter E.

Peter Norvig reports that E accounted for 12.49% of English letters in his analysis of the Google corpus. That’s a better answer than just looking at Gadsby, or even A Tale of Two Cities, but it’s still not *English*.

What might we mean by “English” when discussing letter frequency? Written or spoken English? Since when? American, British, or worldwide? If you mean blog articles, I’ve altered the statistics from what they were a moment ago by publishing this. Introductory statistics books avoid this kind of subtlety by distinguishing between samples and populations, but in this case the population isn’t a fixed thing. When we say “English” as a whole we have in mind some idealization that strictly speaking doesn’t exist.

If we want to say, for example, what the frequency of the letter E is in English as a whole, not some particular English corpus, we can’t answer that to too many decimal places. Nor can we say, for example, which letter is the 18th most frequent. Context could easily change the second decimal place in a letter’s frequency or, among the less common letters, its frequency rank.

And yet, for practical purposes we can say E is the most common letter, then T, etc. We can design better Linotype machines and telegraphy codes using our understanding of letter frequency. At the same time, we can’t expect too much of this information. Anyone who has worked a cryptogram puzzle knows that you can’t say with certainty that the most common letter in a particular sample *must* correspond to E, the next to T, etc.

By the way, Peter Norvig’s analysis suggests that ETAOIN SHRDLU should be updated to ETAOIN SRHLDCU.

]]>

and we can generalize this to the Mittag=Leffler function

which reduces to the exponential function when α = β = 1. There are a few other values of α and β for which the Mittag-Leffler function reduces to more familiar functions. For example,

and

where erfc(*x*) is the complementary error function.

Mittag-Leffler was one person, not two. When I first saw the Mittag-Leffler theorem in complex analysis, I assumed it was named after two people, Mittag and Leffler. But the theorem and the function discussed here are named after one man, the Swedish mathematician Magnus Gustaf (Gösta) Mittag-Leffler (1846–1927).

The function that Mr. Mittag-Leffler originally introduced did not have a β parameter; that generalization came later. The function *E*_{α} is *E*_{α, 1}.

Just as you can make a couple probability distributions out of the exponential function, you can make a couple probability distributions out of the Mittag-Leffler function.

The exponential function exp(-*x*) is positive over [0, ∞) and integrates to 1, so we can define a probability distribution whose density (PDF) function is *f*(*x*) = exp(-*x*) and whose distribution function (CDF) is *F*(*x*) = 1 – exp(-*x*). The Mittag-Leffler distribution has CDF is 1 – *E*_{α}(-*x*^{α}) and so reduces to the exponential distribution when α = 1. For 0 < α < 1, the Mittag-Leffler distribution is a heavy-tailed generalization of the exponential. [1]

The Poisson distribution comes from taking the power series for exp(λ), normalizing it to 1, and using the *k*th term as the probability mass for *k*. That is,

The analogous discrete Mittag-Leffler distribution [2] has probability mass function

In addition to probability and statistics, the the Mittag-Leffler function comes up in fractional calculus. It plays a role analogous to that of the exponential distribution in classical calculus. Just as the solution to the simply differential equation

is exp(*ax*), for 0 < μ < 1, the soltuion to the **fractional differential equation**

is *a**x*^{μ-1} *E*_{μ, μ}(*a* *x*^{μ}). Note that this reduces to exp(*ax*) when μ = 1. [3]

[1] Gwo Dong Lin. Journal of Statistical Planning and Inference 74 (1998) 1–9, On the Mittag–Leffler distributions

[2] Subrata Chakraborty, S. H. Ong. Mittag-Leffler function distribution: A new generalization of hyper-Poisson distribution. arXiv:1411.0980v1

[3] Keith Oldham, Jan Myland, Jerome Spanier. An Atlas of Functions. Springer.

]]>Some zip code are so sparsely populated that people living in these areas are relatively easy to identify if you have other data. The so-called Safe Harbor provision of HIPAA (Health Insurance Portability and Accountability Act) says that it’s usually OK to include the first three digits of someone’s zip code in de-identified data. But there are 17 areas so thinly populated that even listing the first three digits of their zip code is considered too much of an identification risk. These are areas such that the first three digits of the zip code are:

- 036
- 059
- 063
- 102
- 203
- 556
- 692
- 790
- 821
- 823
- 830
- 831
- 878
- 879
- 884
- 890
- 893

This list could change over time. These are the regions that currently contain fewer than 20,000 people, the criterion given in the HIPAA regulations.

Knowing that someone is part of an area containing 20,000 people hardly identifies them. The concern is that in combination with other information, zip code data is more informative in these areas.

**Related post**: Bayesian clinical trials in one zip code

John Tukey coined many terms that have passed into common use, such as **bit **(a shortening of **binary digit**) and software. Other terms he coined are well known within their niche: **boxplot**, **ANOVA**, **rootogram**, etc. Some of his terms, such as **jackknife** and **vacuum cleaner**, were not new words *per se* but common words he gave a technical meaning to.

**Cepstrum** is an anagram of spectrum. It involves an unusual use of power spectra, and is roughly analogous to making anagrams of a word. A related term, one we will get to shortly, is **quefrency**, an anagram of **frequency**. Some people pronounce the ‘c’ in cepstrum hard (like ‘k’) and some pronounce it soft (like ‘s’).

Let’s go back to an example from my post on guitar distortion. Here’s a note played with a fairly large amount of distortion:

And here is its power spectrum:

There’s a lot going on in the spectrum, but the peaks are very regularly spaced. As I mentioned in the post on the sound of a leaf blower, this is the fingerprint of a sound with a definite pitch. Spikes in the spectrum alone don’t indicate a definite pitch if they are irregularly spaced.

The peaks are fairly periodic. How to you find periodic patterns in a signal? Fourier transform! But if you simply take the Fourier transform of a Fourier transform, you essentially get the original signal back. The key to the cepstrum is to do something else between the two Fourier transforms.

The cepstrum starts by taking the Fourier transform, then the magnitude, then the logarithm, and then the inverse Fourier transform.

When we take the magnitude, we throw away phase information, which we don’t need in this context. Taking the log of the magnitude is essentially what you do when you compute sound pressure level. Some define the cepstrum using the magnitude of the Fourier transform and some the magnitude squared. Squaring only introduces a multiple of 2 once we take logs, so it doesn’t effect the location of peaks, only their amplitude.

Taking the logarithm compresses the peaks, bringing them all into roughly the same range, making the sequence of peaks roughly periodic.

When we take the inverse Fourier transform, we now have something like a frequency, but inverted. This is what Tukey called **quefrency**.

Looking at the guitar power spectrum above, we see a sequence of peaks spaced 440 Hz apart. When we take the inverse Fourier transform of this, we’re looking at a sort of frequency of a frequency, what Tukey calls quefrency. The quefrency scale is inverted: sounds with a high frequency fundamental have overtones that are far apart on the frequency domain, so the sequence of the overtone peaks has *low* frequency.

Here’s the plot of the cepstrum for the guitar sample.

There’s a big peak at 109 on the quefrency scale. The audio clip was recorded at 48000 samples per second, so the 109 on the quefrency scale corresponds to a frequency of 48000/109 = 440 Hz. The second peak is at quefrency 215, which corresponds to 48000/215 = 223 Hz. The second peak corresponds to the perceived pitch of the note, A3, and the first peak corresponds to its first harmonic, A4. (Remember the quefrency scale is inverted relative to the frequency scale.)

I cheated a little bit in the plot above. The very highest peaks are at 0. They are so large that they make it hard to see the peaks we’re most interested in. These low quefrency peaks correspond to very high frequency noise, near the edge of the audible spectrum or beyond.

]]>

I believe rib eye steaks are better for you than rat poison. My basis for that belief is anecdotal evidence. People who have eaten rib eye steaks have fared better than people who have eaten rat poison. I don’t have exact numbers on that, but I’m pretty sure it’s true. I have more confidence in that than in any clinical trial conclusion.

Hearsay evidence about food isn’t very valuable, per observation, but since millions of people have eaten steak for thousands of years, the cumulative weight of evidence is pretty good that steak is harmless if not good for you. The number of people who have eaten rat poison is much smaller, but given the large effect size, there’s ample reason to suspect that eating rat poison is a bad idea.

Now suppose you want to get more specific and determine whether rib eye steaks are good for *you* in particular. (I wouldn’t suggest trying rat poison.) Suppose you’ve noticed that you feel better after eating a steak. Is that an anecdote or data? What if you look back through your diary and noticed that every mention of eating steak lately has been followed by some remark about feeling better than usual. Is that data? What if you decide to flip a coin each day for the next month and eat steak if the coin comes up heads and tofu otherwise. Each of these steps is an improvement, but there’s no magical line you cross between anecdote and data.

Suppose you’re destructively testing the strength of concrete samples. There are better and worse ways to conduct such experiments, but each sample gives you valuable data. If you test 10 samples and they all withstand two tons of force per square inch, you have good reason to believe the concrete the samples were taken from can withstand such force. But if you test a drug on 10 patients, you can’t have the same confidence that the drug is effective. Human subjects are more complicated than concrete samples. Concrete samples aren’t subject to placebo effects. Also, cause and effect are more clear for concrete. If you apply a load and the sample breaks, you can assume the load caused the failure. If you treat a human for a disease and they recover, you can’t be as sure that the treatment caused the recovery. That doesn’t mean medical observations aren’t data.

Carefully collected observations in one area may be less statistically valuable than anecdotal observations in another. Observations are never ideal. There’s always some degree of bias, effects that can’t be controlled, etc. There’s no quantum leap between useless anecdotes and perfectly informative data. Some data are easy to draw inference from, but data that’s harder to understand doesn’t fail to be data.

]]>People are not completely described by a handful of numbers. We’re much more complicated than that. But even in systems that are well described by a few numbers, the region around the average can be nearly empty. I’ll explain why that’s true in general, then look back at the Norma example.

Suppose you have *N* points, each described by *n* independent, standard normal random variables. That is, each point has the form (*x*_{1}, *x*_{2}, *x*_{2}, …, *x*_{n}) where each *x*_{i} is independent with a normal distribution with mean 0 and variance 1. The expected value of each coordinate is 0, so you might expect that most points are piled up near the origin (0, 0, 0, …, 0). In fact most points are in spherical shell around the origin. Specifically, as *n* becomes larger, most of the points will be in a thin shell with distance √*n* from the origin. (More details here.)

In the contest above, *n* = 9, and so we expect most contestants to be about a distance of 3 from average when we normalize each of the factors being measured, i.e. we subtract the mean so that each factor has mean 0, and we divide each by its standard deviation so the standard deviation is 1 on each factor.

We’ve made several simplifying assumptions. For example, we’ve assumed independence, though presumably some of the factors measured in the contest were correlated. There’s also a selection bias: presumably women who knew they were far from average would not have entered the contest. But we’ll run with our simplified model just to see how it behaves in a simulation.

import numpy as np # Winning critera: minimum Euclidean distance def euclidean_norm(x): return np.linalg.norm(x) # Winning criteria: min-max def max_norm(x): return max(abs(x)) n = 9 N = 3864 # Simulated normalized measurements of contestants M = np.random.normal(size=(N, n)) euclid = np.empty(N) maxdev = np.empty(N) for i in range(N): euclid[i] = euclidean_norm(M[i,:]) maxdev[i] = max_norm(M[i,:]) w1 = euclid.argmin() w2 = maxdev.argmin() print( M[w1,:] ) print( euclidean_norm(M[w1,:]) ) print( M[w2,:] ) print( max_norm(M[w2,:]) )

There are two different winners, depending on how we decide the winner. Using the Euclidean distance to the origin, the winner in this simulation was contestant 3306. Her normalized measurements were

[ 0.1807, 0.6128, -0.0532, 0.2491, -0.2634, 0.2196, 0.0068, -0.1164, -0.0740]

corresponding to a Euclidean distance of 0.7808.

If we judge the winner to be the one whose largest deviation from average is the smallest, the winner is contestant 1916. Her normalized measurements were

[-0.3757, 0.4301, -0.4510, 0.2139, 0.0130, -0.2504, -0.1190, -0.3065, -0.4593]

with the largest deviation being the last, 0.4593.

By either measure, the contestant closest to the average deviated significantly from the average in at least one dimension.

* * *

For daily posts on probability, follow @ProbFact on Twitter.

]]>You could say the same about nonlinear differential equations. Differential equations are so often nonlinear that the “nonlinear” qualifier isn’t always necessary to say explicitly. Just as a Bayesian analysis isn’t interesting just because it’s Bayesian, a differential equation isn’t necessarily interesting just because it’s nonlinear.

The analogy between Bayesian statistics and nonlinear differential equations breaks down though. Nonlinear equations are intrinsically more interesting than linear ones. But it’s no longer remarkable to solve a nonlinear differential equation numerically.

When an adjective becomes the default, it drops off and the previous default now requires an adjective. Terms like “electronic” and “digital” are fading from use. If you say you’re going to mail someone something, the default assumption is usually that you are going to email it. What used to be simply “mail” is now “snail mail.” Digital signal processing is starting to sound quaint. The abbreviation DSP is still in common use, but digital signal processing is simply signal processing. Now non-digital signal processing requires a qualifier, i.e. analog.

There was no term for Frequentist statistics when it was utterly dominant. Now of course there is. (Some people use the term “classical,” but that’s an odd term given that Bayesian analysis is older.) The term linear has been around a long time. Even when nearly all analysis was linear, people were aware that linearity was a necessary simplification.

**Related posts**:

This inequality is very general, but also very weak. It assumes very little about the random variable *X* but it also gives a loose bound. If we assume slightly more, namely that *X* has a **unimodal** distribution, then we have a tighter bound, the Vysochanskiĭ-Petunin inequality.

However, the Vysochanskiĭ-Petunin inequality does require that *k* be larger than √(8/3). In exchange for the assumption of unimodality and the restriction on *k* we get to reduce our upper bound by more than half.

While tighter than Chebyshev’s inequality, the stronger inequality is still very general. We can usually do much better if we can say more about the distribution family. For example, suppose *X* has a uniform distribution. What is the probability that *X* is more than two standard deviations from its mean? Zero, because two standard deviations puts you outside the interval the uniform is defined on!

Among familiar distributions, when is the Vysochanskiĭ-Petunin inequality most accurate? That depends, of course, on what distributions you consider familiar, and what value of *k* you use. Let’s look at normal, exponential, and Pareto. These were chosen because they have thin, medium, and thick tails. We’ll also throw in the double exponential, because it has the same tail thickness as exponential but is symmetric. We’ll let *k* be 2 and 3.

Distribution family | P(|X – E(X)| > 2σ) |
V-P estimate | P(|X – E(X)| > 3σ) |
V-P estimate |
---|---|---|---|---|

Uniform | 0.0000 | 0.1111 | 0.0000 | 0.0484 |

Normal | 0.0455 | 0.1111 | 0.0027 | 0.0484 |

Exponential | 0.0498 | 0.1111 | 0.0183 | 0.0484 |

Pareto | 0.0277 | 0.1111 | 0.0156 | 0.0484 |

Double exponential | 0.0591 | 0.1111 | 0.0144 | 0.0484 |

A normal random variable is more than 2 standard deviations away from its mean with probability 0.0455, compared to the Vysochanskiĭ-Petunin bound of 1/9 = 0.1111. A normal random variable is more than 3 standard deviations away from its mean with probability 0.0027, compared to the bound of 4/81 = 0.0484.

An exponential random variable with mean μ also has standard deviation μ, so the only way it could be more than 2μ from its mean is to be 3μ from 0. So an exponential is more that 2 standard deviations from its mean with probability exp(-3) = 0.0498, and more than 3 standard deviations with probability exp(-4) = 0.0183.

We’ll set the minimum value of our Pareto random variable to be 1. As with the exponential, the Pareto cannot be 2 standard deviations less than its mean, so we look at the probability of it being more than 2 greater than its mean. The shape parameter α must be bigger than 2 for for the variance to exist. The probability of our random variable being more than *k* standard deviations away from its mean works out to ((α-1)/((*k*-1)α))^{α} and is largest as α converges down toward 2. The limiting values for *k* equal to 2 and 3 are 1/36 = 0.0277 and 1/64 = 0.0156 respectively. Of our examples, the Pareto distribution comes closest to the Vysochanskiĭ-Petunin bounds, but doesn’t come that close.

The double exponential, also know as Laplace, has the highest probability of any of our examples of being two standard deviations from its mean, but this probability is still less than half of the Vysochanskiĭ-Petunin bound. The limit of the Pareto distribution has the highest probability of being three standard deviations from its mean, but stil less than one-third of the Vysochanskiĭ-Petunin bound.

Generic bounds are useful, especially in theoretical calculations, but it’s usually possible to do much better with specific distributions.

**More inequality posts**:

For daily posts on probability, follow @ProbFact on Twitter.

]]>My comments start at 1:30. In a nutshell, I predict that data analytics will work its way down from large companies to small companies.

]]>The hypergeometric distribution is a probability distribution with parameters *N*, *M*, and *n*. Suppose you have an urn containing *N* balls, *M* red and the rest, *N* – *M* blue and you select *n* balls at a time. The hypergeometric distribution gives the probability of selecting *k* red balls.

The **probability generating function** for a discrete distribution is the series formed by summing over the probability of an outcome *k* and *x*^{k}. So the probability generating function for a hypergeometric distribution is given by

The summation is over all integers, but the terms are only non-zero for *k* between 0 and *M* inclusive. (This may be more general than the definition of binomial coefficients you’ve seen before. If so, see these notes on the general definition of binomial coefficients.)

It turns out that *f* is a **hypergeometric function** of *x* because it is can be written as a **hypergeometric series**. (Strictly speaking, *f* is a constant multiple of a hypergeometric function. More on that in a moment.)

A hypergeometric function is defined by a pattern in its power series coefficients. The hypergeometric function *F*(*a, **b*; *c*; *x*) has a the power series

where (*n*)_{k} is the *k*th rising power of *n*. It’s a sort of opposite of factorial. Start with *n* and multiply consecutive *increasing* integers for *k* terms. (*n*)_{0} is an empty product, so it is 1. (*n*)_{1} = *n*, (*n*)_{2} = *n*(*n*+1), etc.

If the ratio of the *k*+1st term to the *k*th term in a power series is a polynomial in *k*, then the series is a (multiple of) a hypergeometric series, and you can read the parameters of the hypergeometric series off the polynomial. This ratio for our probability generating function works out to be

and so the corresponding hypergeometric function is *F*(-*M*, –*n*; *N* – *M* – *n* + 1; *x*). The constant term of a hypergeometric function is always 1, so evaluating our probability generating function at 0 tells us what the constant is multiplying *F*(-*M*, –*n*; *N* – *M* – *n* + 1; *x*). Now

and so

The hypergeometric series above gives the original hypergeometric function as defined by Gauss, and may be the most common form in application. But the definition has been extended to have any number of rising powers in the numerator and denominator of the coefficients. The classical hypergeometric function of Gauss is denoted _{2}*F*_{1} because it has two falling powers on top and one on bottom. In general, the hypergeometric function _{p}*F*_{q} has *p* rising powers in the denominator and *q* rising powers in the denominator.

The CDF of a hypergeometric distribution turns out to be a more general hypergeometric function:

where *a* = 1, *b* = *k*+1-*M*, *c* = *k*+1-*n*, *d = k*+2, and *e* = *N*+*k*+2-*M*–*n*.

Thanks to Jan Galkowski for suggesting this topic via a comment on an earlier post, Hypergeometric bootstrapping.

* * *

For daily posts on probability, follow @ProbFact on Twitter.

]]>But the combination of reproducible and random can be reconciled. Why would we want a randomized controlled trial (RCT) to be random, and why would we want it to be reproducible?

**One of the purposes** in randomized experiments is the hope of scattering complicating factors evenly between two groups. For example, one way to test two drugs on a 1000 people would be to gather 1000 people and give the first drug to all the men and the second to all the women. But maybe a person’s sex has something to do with how the drug acts. If we randomize between two groups, it’s likely that about the same number of men and women will be in each group.

The example of sex as a factor is oversimplified because there’s reason to suspect *a priori* that sex might make a difference in how a drug performs. The bigger problem is that factors we can’t anticipate or control may matter, and we’d like them scattered evenly between the two treatment groups. If we knew what the factors were, we could assure that they’re evenly split between the groups. The hope is that randomization will do that for us with things we’re unaware of. For this purpose we don’t need a process that is “truly random,” whatever that means, but a process that matches our expectations of how randomness should behave. So a pseudorandom number generator (PRNG) is fine. No need, for example, to randomize using some physical source of randomness like radioactive decay.

**Another purpose** in randomization is for the assignments to be unpredictable. We want a physician, for example, to enroll patients on a clinical trial without knowing what treatment they will receive. Otherwise there could be a bias, presumably unconscious, against assigning patients with poor prognosis if the physicians know the next treatment be the one they hope or believe is better. Note here that the randomization only has to be unpredictable from the perspective of the people participating in and conducting the trial. The assignments could be predictable, in principle, by someone *not* involved in the study.

And why would you want an randomization assignments to be **reproducible**? One reason would be to test whether randomization software is working correctly. Another might be to satisfy a regulatory agency or some other oversight group. Still another reason might be to defend your randomization in a law suit. A physical random number generator, such as using the time down to the millisecond at which the randomization is conducted would achieve random assignments and unpredictability, but not reproducibility.

Computer algorithms for generating random numbers (technically pseudo-random numbers) can achieve reproducibility, practically random allocation, and unpredictability. The randomization outcomes are predictable, and hence reproducible, to someone with access to the random number generator and its state, but unpredictable in practice to those involved in the trial. The internal state of the random number generator has to be saved between assignments and passed back into the randomization software each time.

Random number generators such as the Mersenne Twister have good statistical properties, but they also carry a large amount of state. The random number generator described here has very small state, 64 bits, and so storing and returning the state is simple. If you needed to generate a trillion random samples, Mersenne Twitster would be preferable, but since RCTs usually have less than a trillion subjects, the RNG in the article is perfectly fine. I have run the Die Harder random number generator quality tests on this generator and it performs quite well.

Need help with randomized trials? Let’s talk.

]]>I got a call one time to take a look at randomization software that wasn’t randomizing. My first thought was that the software was working as designed, and that the users were just seeing a long run. Long sequences of the same assignment are more likely than you think. You might argue, for example, that the chances of flipping five heads in a row would be (1/2)^{5} = 1/32, but that underestimates the probability because a run could start at any time. The chances that the *first* five flips are heads would indeed be 1/32. But the probability of seeing five heads in a row any time during a series of flips is much higher.

Most of the times that I’ve been asked to look at randomization software that “couldn’t be right,” the software was fine. But in this case, there wasn’t simply a long run of random results that happened to be equal. The results were truly constant. At least for some users. Some users would get different results from time to time, but others would get the same result every single time.

The problem turned out to be how the software set the seed in its random number generator. When the program started up it asked the user “Enter a number.” No indication of what kind of number or for what purpose. This number, unbeknownst to the user, was being used as the random number generator seed. Some users would enter the same number every time, and get the same randomization result, every time. Others would use more whimsy in selecting numbers and get varied output.

How do you seed a random number generator in a situation like this? A better solution would be to seed the generator with the current time, though that has drawbacks too. I write about that in another post.

A more subtle problem I’ve seen with random number generator seeding is spawning multiple processes that each generate random numbers. In a well-intentioned attempt to give each process a unique seed, the developers ended up virtually assuring that many of the processes would have exactly the same seed.

If you parallelize a huge simulation by spreading out multiple copies, but two of the processes use the same seed, then their results will be identical. Throwing out the redundant simulation would reduce your number of samples, but not noticing and keeping the redundant output would be worse because it would cause you to underestimate the amount of variation.

To avoid duplicate seeds, the developers used a random number generator to assign the RNG seeds for each process. Sounds reasonable. Randomly assigned RNG seeds give you even more random goodness. Except they don’t.

The developers had run into a variation on the famous birthday problem. In a room of 23 people, there’s a 50% chance that two people share the same birthday. And with 50 people, the chances go up to 97%. It’s not *certain* that two people will have the same birthday until you have 367 people in the room, but the chances approach 1 faster than you might think.

Applying the analog of the birthday problem to the RNG seeds explains why the project was launching processes with the same seed. Suppose you seed each process with an unsigned 16-bit integer. That means there are 65,536 possible seeds. Now suppose you launch 1,000 processes. With 65 times as many possible seeds as processes, surely every process should get its own seed, right? Not at all. There’s a 99.95% chance that two processes will have the same seed.

In this case it would have been better to seed each process with *sequential* seeds: give the first process seed 1, the second seed 2, etc. The *seeds* don’t have to be random; they just have to be unique. If you’re using a good random number generator, the outputs of 1,000 processes seeded with 1, 2, 3, …, 1000 will be independent.

]]>

A Markov chain has no memory. That’s its defining characteristic: its future behavior depends solely on where it is, not how it got there. So if you “burn in” a thousand samples, your future calculations are absolutely no different than if you had started where there first thousand samples left off. Also, any point you start at is a point you might return to, or at least return arbitrarily close to again.

So why burn in? To enter a high probability region, a place where the states of the Markov chain are more representative of the distribution you’re sampling. When someone says a chain has “burned in,” that’s fine if they mean “has entered a high probability region.” And why do you want to enter such a region? Because you’re going to average some function of your samples:

The result will be correct as *n* → ∞, but you’re going to stop after some finite *n*. When *n* is small, and your samples are in a low probability region, the average on the right might be a poor approximation to the expectation on the left.

The idea of burn-in is that you can start your MCMC procedure at some point chosen for convenience, one which might be out in the weeds, but then after a few iterations you’ll be in a high probability region. However, you don’t know that this will happen. It *probably* will happen, eventually, by definition: a random process spends most of its time where it spends most of its time! It is possible, though unlikely, that you could be in a lower probability region at the end of your burn-in period than at the beginning. Or maybe your chain is slowly moving toward a higher probability region, but you’re still not close at the end of your burn-in.

If you know where a high probability region is, just start there. Then you’ve “burned in” immediately. However, with a very complicated problem you might not know where a high probability region is. So you hope that a few steps of your chain will land you in a high probability region. And maybe it will. But if you understand your problem so poorly that you have no idea where the probability is concentrated, you’re going to have a hard time evaluating your results.

]]>

The problem is finding the expected value of *f*(*X*) where *X* is some random variable. If you can draw independent samples *x*_{i} from *X*, the solution is simple:

When it’s possible to draw these independent samples, the sum above is well understood. It’s easy to estimate the error after *n* samples, or to turn it the other way around, estimate what size *n* you need so that the error is probably below a desired threshold.

MCMC is a way of making the approximation above work even though it’s not practical to draw independent random samples from *X*. In Bayesian statistics, *X* is the posterior distribution of your parameters and you want to find the expected value of some function of these parameters. The problem is that this posterior distribution is typically complicated, high-dimensional, and unique to your problem (unless you have a simple conjugate model). You don’t know how to draw independent samples from *X*, but there are standard ways to construct a **Markov chain** whose samples make the approximation above work. The samples are **not independent** but in the limit the set of samples has the same distribution as *X*.

MCMC is either simple or mysterious, depending on your perspective. It’s simple to write code that should work, eventually, in theory. Writing efficient code is another matter. And above all, knowing when your answer is good enough is tricky. If the samples were independent, the Central Limit Theorem would tell you how the sum should behave. But since the samples are **dependent**, all bets are off. Almost all we know is that the average on the right side converges to the expectation on the left side. Aside from toy problems there’s very little theory to tell you when your average is sufficiently close to what you want to compute.

Because there’s not much theory, there’s a lot of superstition and folklore. As with other areas, there’s some truth in MCMC superstition and folklore, but also some error and nonsense.

In some ways MCMC is truly marvelous. It’s disappointing that there isn’t more solid theory around convergence, but it lets you take a stab at problems that would be utterly intractable otherwise. In theory you can’t know when it’s time to stop, and in practice you can fool yourself into thinking you’ve seen convergence when you haven’t, such as when you have a bimodal distribution and you’ve only been sampling from one mode. But it’s also common in practice to have some confidence that a calculation has converged because the results are qualitatively correct: this value should be approximately this other value, this should be less than that, etc.

Bayesian statistics is older than Frequentist statistics, but it didn’t take off until statisticians discovered MCMC in the 1980s, after physicists discovered it in the 1950s. Before that time, people might dismiss Bayesian statistics as being interesting but impossible to compute. But thanks to Markov Chain Monte Carlo, and Moore’s law, many problems are numerically tractable that were not before.

Since MCMC gives a universal approach for solving certain kinds of problems, some people equate the algorithm with the problem. That is, they see MCMC as the solution rather than an algorithm to compute the solution. They forget what they were ultimately after. It’s sometimes possible to compute posterior probabilities more quickly and more accurately by other means, such as numerical integration.

]]>

In the phrase “big *n*, little *p*” the symbol *p* means the number of measurements per subject. Traditional data sets are “big *n*, little *p*” because you have far more subjects than measurements per subject. For example, maybe you measure 10 things about 1000 patients.

Big data sets, such as those coming out of bioinformatics, are often “big *p*, little *n*.” For example, maybe you measure 20,000 biomarkers on 50 patients. This turns classical statistics sideways, literally and figuratively, literally in the sense that a “big *p*, little *n*” data set looks like the transpose of a “big *n*, little *p*” data set.

From the vantage point of a traditional statistician, “big *p*, little *n*” data sets give you very little to work with. If *n* is small, it doesn’t matter how big *p* is. In the example above, *n* = 50, not a big data set. But the biologist will say “What do you mean it’s not a big data set? I’ve given you 1,000,000 measurements!”

So how to you take advantage of large *p* even though *n* is small? That’s the big question. It summarizes the research program of many people in statistics and machine learning. There’s no general answer, at least not yet, though progress is being made in specific applications.

**Related post**: Nomenclatural abomination

For daily tips on data science, follow @DataSciFact on Twitter.

]]>Now suppose we know that a project has lasted until *t*_{0} so far. Then the expected finish time is α*t*_{0}/(α-1) and so the expected additional time is *t*_{0}/(α-1). Note that both are proportional to *t*_{0}. So the longer it has taken, the longer it will take. If the project is running late, you can expect the time remaining to be even more than the expected time before the project started. The finish line is moving away from you!

For example, suppose α = 2 (in applications of power laws, α is often between 1 and 3) and you’re measuring time in years. When the project starts at *t* = 1, it is expected to take one year, until *t* = 2. Now suppose you’re starting the second year and the project isn’t done. Now it’s expected to finish at *t* = 4, two more years. When you started, the project was supposed to take a year. One year later, it has taken a year, and should be expected to take two more years. I said “should be expected” rather than “is expected” because no one would believe such an estimate. (Ever heard of the Big Dig? Or other megaprojects?)

Note that we have computed the conditional probability given only the time it has taken so far, and *no other information*. If you know more, for example maybe you know that some specific pieces have been completed, then you should use that information.

This is related to the Lindy effect. The longer a cultural artifact has been around, the longer it is expected to last into the future.

* * *

For daily posts on probability, follow @ProbFact on Twitter.

]]>Social media data is undoubtedly big. However, when we zoom into individuals for whom, for example, we would like to make relevant recommendations, we often have little data for each specific individual. We have to exploit the characteristics of social media and use its multidimensional, multisource, and multisite data to aggregate information with sufficient statistics for effective mining.

Brad Efron said something similar:

… enormous data sets often consist of enormous numbers of small sets of data, none of which by themselves are enough to solve the thing you are interested in, and they fit together in some complicated way.

Big data doesn’t always tell us directly what we’d like to know. It may give us a gargantuan amount of slightly related data, from which we may be able to tease out what we want.

**Related post**: New data, not just bigger data

A naive approach would be to gloss over the fact that you have discrete data and use the MLE (maximum likelihood estimator) for continuous data. That does a very poor job [1]. The discrete case needs its own estimator.

To illustrate this, we start by generating 5,000 samples from a discrete power law with exponent 3.

import numpy.random alpha = 3 n = 5000 x = numpy.random.zipf(alpha, n)

The continuous MLE is very simple to implement:

alpha_hat = 1 + n / sum(log(x))

Unfortunately, it gives an estimate of 6.87 for alpha, though we know it should be around 3.

The MLE for the discrete power law distribution satisfies

Here ζ is the Riemann zeta function, and *x*_{i} are the samples. Note that the left side of the equation is the derivative of log ζ, or what is sometimes called the logarithmic derivative.

There are three minor obstacles to finding the estimator using Python. First, SciPy doesn’t implement the Riemann zeta function ζ(*x*) per se. It implements a generalization, the Hurwitz zeta function, ζ(*x*, *q*). Here we just need to set *q* to 1 to get the Riemann zeta function.

Second, SciPy doesn’t implement the derivative of zeta. We don’t need much accuracy, so it’s easy enough to implement our own. See an earlier post for an explanation of the implementation below.

Finally, we don’t have an explicit equation for our estimator. But we can easily solve for it using the bisection algorithm. (Bisect is slow but reliable. We’re not in a hurry, so we might as use something reliable.)

from scipy import log from scipy.special import zeta from scipy.optimize import bisect xmin = 1 def log_zeta(x): return log(zeta(x, 1)) def log_deriv_zeta(x): h = 1e-5 return (log_zeta(x+h) - log_zeta(x-h))/(2*h) t = -sum( log(x/xmin) )/n def objective(x): return log_deriv_zeta(x) - t a, b = 1.01, 10 alpha_hat = bisect(objective, a, b, xtol=1e-6) print(alpha_hat)

We have assumed that our data follow a power law immediately from *n* = 1. In practice, power laws generally fit better after the first few elements. The code above works for the more general case if you set `xmin`

to be the point at which power law behavior kicks in.

The bisection method above searches for a value of the power law exponent between 1.01 and 10, which is somewhat arbitrary. However, power law exponents are very often between 2 and 3 and seldom too far outside that range.

The code gives an estimate of α equal to 2.969, very near the true value of 3, and much better than the naive estimate of 6.87.

Of course in real applications you don’t know the correct result before you begin, so you use something like a confidence interval to give you an idea how much uncertainty remains in your estimate.

The following equation [2] gives a value of σ from a normal approximation to the distribution of our estimator.

So an approximate 95% confidence interval would be the point estimate +/- 2σ.

from scipy.special import zeta from scipy import sqrt def zeta_prime(x, xmin=1): h = 1e-5 return (zeta(x+h, xmin) - zeta(x-h, xmin))/(2*h) def zeta_double_prime(x, xmin=1): h = 1e-5 return (zeta(x+h, xmin) -2*zeta(x,xmin) + zeta(x-h, xmin))/h**2 def sigma(n, alpha_hat, xmin=1): z = zeta(alpha_hat, xmin) temp = zeta_double_prime(alpha_hat, xmin)/z temp -= (zeta_prime(alpha_hat, xmin)/z)**2 return 1/sqrt(n*temp) print( sigma(n, alpha_hat) )

Here we use a finite difference approximation for the second derivative of zeta, an extension of the idea used above for the first derivative. We don’t need high accuracy approximations of the derivatives since statistical error will be larger than the approximation error.

In the example above, we have α = 2.969 and σ = 0.0334, so a 95% confidence interval would be [2.902, 3.036].

[1] Using the continuous MLE with discrete data is not so bad when the minimum output *x*_{min} is moderately large. But here, where *x*_{min} = 1 it’s terrible.

[2] Equation 3.6 from Power-law distributions in empirical data by Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman.

]]>