Big Data and data protection law provide for a number of mutual conflicts: from the perspective of Big Data analytics, a strict application of data protection law as we know it today would set an immediate end to most Big Data applications. From the perspective of the law, Big Data is either a big threat … or a major challenge for international and national lawmakers to adopt today’s data protection laws to the latest technological and economic developments.
The author of the chapter on legal matters is Swiss and writes primarily in a European context, though all countries face similar problems.
I’m not a lawyer, though I sometimes work with lawyers, and sometimes help companies with the statistical aspects of HIPAA law. But as a layman the observation above sounds reasonable to me, that strict application of the law could bring many applications to a halt, for better and for worse.
In my opinion the regulations around HIPAA and de-identification are mostly reasonable. The things it prohibits mostly should be prohibited. And it has a common sense provision in the form of expert determination. If your data uses fall outside the regulation’s specific recommendations but don’t endanger privacy, you can have an expert can certify that this is the case.
Suppose you shuffle a deck of cards. How likely is it that there are two cards that were next to each other before the shuffle are still next to each other after the shuffle?
It depends on how well you shuffle the cards. If you do what’s called a “faro shuffle” then the probability of a pair of cards remaining neighbors is 0. In a faro shuffle, if the cards were originally numbered 1 through 52, the new arrangement will be 1, 27, 2, 28, 3, 29, …, 26, 52. All pairs are split up.
But if you shuffle the cards so well that the new arrangement is a random permutation of the original, there’s about an 86% chance that at least one pair of neighbors remain neighbors. To be more precise, consider permutations of n items. As n increases, the probability of no two cards remaining consecutive converges to exp(-2), and so the probability of at least two cards remaining next to each other converges to 1 – exp(-2).
By the way, this considers pairs staying next to each other in either order. For example, if the cards were numbered consecutively, then either a permutation with 7 followed by 8 or 8 followed by 7 would count.
More generally, for large n, the probability of k pairs remaining consecutive after a shuffle is 2ke-2 / k!.
One application of this result would be testing. If you randomly permute a list of things to break up consecutive pairs, it probably won’t work. Instead, you might want to split your list in half, randomize each half separately, then interleave the two half lists as in the faro shuffle.
Another application would be fraud detection. If you’re suspicious that a supposedly random process isn’t splitting up neighbors, you could use the result presented here to calibrate your expectations. Maybe what you’re seeing is consistent with good randomization. Or maybe not. Note that the result here looks at any pair remaining together. If a particular pair remains together consistently, that’s more suspicious.
A hash function maps arbitrarily long input strings to fixed-length outputs. For example, SHA-256 maps its input to a string of 256 bits. A cryptographically secure hash function h is a one-way function, i.e. given a message m it’s easy to compute h(m) but it’s not practical to go the other way, to learn anything about m from h(m). Secure hash functions are useful for message authentication codes because it is practically impossible to modify m without changing h(m).
Ideally, a secure hash is “indistinguishable from a random mapping.”  So if a hash function has a range of size N, how many items can we send through the hash function before we can expect two items to have same hash value? By the pigeon hole principle, we know that if we hash N+1 items, two of them are certain to have the same hash value. But it’s likely that a much smaller number of inputs will lead to a collision, two items with the same hash value.
The famous birthday problem illustrates this. You could think of birthdays as a random mapping of people into 366 possible values . In a room of less than 366 people, it’s possible that everyone has a different birthday. But in a group of 23 people, there are even odds that two people have the same birthday.
Variations on the birthday problem come up frequently. For example, in seeding random number generators. And importantly for this post, the birthday problem comes up in attacking hash functions.
When N is large, it is likely that hashing √N values will lead to a collision. We prove this below.
The proof below is a little informal. It could be made more formal by replacing the approximate equalities with equalities and adding the necessary little-o terms.
Suppose we’re hashing n items to a range of size N = n2. The exact probability that all n items have unique hash values is given in here. Taking the log of both sides gives us the first line of the proof below.
The first approximation is based on the first three terms in the asymptotic expansion for log Γ given here, applied to both log gamma expressions. (The third terms from the two asymptotic series are the same so they cancel out.) The second line isn’t exactly what you’d get by applying the asymptotic expansion. It’s been simplified a little. The neglected terms are not a mistake but terms that can be left out because they go to zero.
The second approximation comes from using the first two terms in the power series for log(1 + x). One term isn’t enough since that would reduce to zero. The final approximation is simply taking the limit as n goes to infinity. Concretely, we’re willing to say that a billion and one divided by a billion is essentially 1.
So the probability of no collisions is exp(-1/2) or about 60%, which means there’s a 40% chance of at least one collision. As a rule of thumb, a hash function with range of size N can hash on the order of √N values before running into collisions.
This means that with a 64-bit hash function, there’s about a 40% chance of collisions when hashing 232 or about 4 billion items. If the output of the hash function is discernibly different from random, the probability of collisions may be higher. A 64-bit hash function cannot be secure since an attacker could easily hash 4 billion items. A 256-bit or 512-bit hash could in principle be secure since one could expect to hash far more items before collisions are likely. Whether a particular algorithm like SHA3-512 is actually secure is a matter for cryptologists, but it is at least feasible that a hash with a 512-bit range could be secure, based on the size of its range, while a 64-bit hash cannot be.
We used an asymptotic argument above rather than numerically evaluating the probabilities because this way we get a more general result. But even if we were only interested in a fix but large n, we’d run into numerical problems. This is one of those not uncommon cases where a pencil-and-paper approximation is actually more accurate than direct calculation with no (explicit) approximations.
There are numerous numerical problems with direct calculation of the collision probability. First, without taking logarithms we’d run into overflow and underflow. Second, for large enough n, n2 – n = n2 in floating point representation. IEEE 754 doubles have 53 bits of precision. This isn’t enough to distinguish values that differ, say, in the 128th bit. Finally, the two log gamma terms are large, nearly equal numbers. The cardinal rule of numerical analysis is to avoid subtracting nearly equal numbers. If two numbers agree to k bits, you could lose k bits of precision in carrying out their difference. See Avoiding overflow, underflow, and loss of precision for more along these lines.
 Cryptography Engineering by Ferguson, Schneier, and Kohno
 Leap year of course complicates things since February 29 birthdays are less common than other birthdays. Another complication is that birthdays are not entirely evenly distributed for the other days of the year. But these complications don’t ruin the party trick: In a room of 30 people, two people usually share a birthday.
Bayesian methods are often characterized as “subjective” because the user must choose a prior distribution, that is, a mathematical expression of prior information. The prior distribution requires information and user input, that’s for sure, but I don’t see this as being any more “subjective” than other aspects of a statistical procedure, such as the choice of model for the data (for example, logistic regression) or the choice of which variables to include in a prediction, the choice of which coefficients should vary over time or across situations, the choice of statistical test, and so forth. Indeed, Bayesian methods can in many ways be more “objective” than conventional approaches in that Bayesian inference, with its smoothing and partial pooling, is well adapted to including diverse sources of information and thus can reduce the number of data coding or data exclusion choice points in an analysis.
People worry about prior distributions, not because they’re subjective, but because they’re explicitly subjective. There are many other subjective factors, common to Bayesian and Frequentist statistics, but these are implicitly subjective.
In practice, prior distributions often don’t make much difference. For example, you might show that an optimistic prior and a pessimistic prior lead to the same conclusion.
If you have so little data that the choice of prior does make a substantial difference, being able to specify the prior is a benefit. Suppose you only have a little data but have to make a decision anyway. A frequentist might say there’s too little data for a statistical analysis to be meaningful. So what do you do? Make a decision entirely subjectively! But with a Bayesian approach, you capture what is known outside of the data at hand in the form of a prior distribution, and update the prior with the little data you have. In this scenario, a Bayesian analysis is less subjective and more informed by the data than a frequentist approach.
In geometry, you’d say that if a square has side x, then it has area x2.
In calculus, you’d say more. First you’d say that if a square has side nearx, then it has area nearx2. That is, area is a continuous function of the length of a side. As the length of the side changes, there’s never an abrupt jump in area. Next you could be more specific and say that a small change Δx to a side of length x corresponds to approximately a change of 2x Δx in the area.
In probability, you ask what is the area of a square like if you pick the length of its side at random. If you pick the length of the side from a distribution with mean μ, does the distribution of the area have mean μ2? No, but if the probability distribution on side length is tightly concentrated around μ, then the distribution on area will be concentrated near μ2. And you can approximate just how near the area is to μ2 using the delta method, analogous to the calculus discussion above.
If the distribution on side lengths is not particularly concentrated, finding the distribution on the area is more interesting. It will depend on the specific distribution on side length, and the mean area might not be particularly close to the square of the mean side length. The function to compute area is trivial, and yet the question of what happens when you stick a random variable into that function is not trivial. Random variables behave as you might expect when you stick them into linear functions, but offer surprises when you stick them into nonlinear functions.
Suppose you pick the length of the side of a square uniformly from the interval [0, 1]. Then the average side is 1/2, and so you might expect the average area to be 1/4. But the expected area is actually 1/3. You could see this a couple ways, analytically and empirically.
First an analytical derivation. If X has a uniform [0, 1] distribution and Z = X2, then the CDF of Z is
Prob(Z ≤ z) = Prob(X ≤ √z) = √ z.
and so the PDF for Z, the derivative of the CDF, is -1/2√z. From there you can compute the expected value by integrating z times the PDF.
You could check your calculations by seeing whether simulation gives you similar results. Here’s a little Python code to do that.
from random import random
N = 1000000
print( sum([random()**2 for _ in range(N)] )/N )
When I run this, I get 0.33386, close to 1/3.
Now lets look at an exponential distribution on side length with mean 1. Then a calculation similar to the one above shows that the expected value of the product is 2. You can also check this with simulation. This time we’ll be a little fancier and let SciPy generate our random values for us.
print( sum(expon.rvs(size=N)**2)/N )
When I ran this, I got 1.99934, close to the expected value of 2.
You’ll notice that in both examples, the expected value of the area is more than the square of the expected value of the side. This is not a coincidence but consequence of Jensen’s inequality. Squaring is a convex function, so the expected value of the square is larger than the square of the expected value for any random variable.
The hazard function of a probability distribution is the instantaneous probability density of an event given that it hasn’t happened yet. This works out to be the ratio of the PDF (probability density function) to the CCDF (complementary cumulative density function).
For the standard normal distribution, the hazard function is
and has a surprisingly simple continued fraction representation:
Aside from being an elegant curiosity, this gives an efficient way to compute the hazard function for large x. (It’s valid for any positive x, but most efficient for large x.)
An interim analysis of a clinical trial is an unusual analysis. At the end of the trial you want to estimate how well some treatment X works. For example, you want to how likely is it that treatment X works better than the control treatment Y. But in the middle of the trial you want to know something more subtle.
It’s possible that treatment X is doing so poorly that you want to end the trial without going any further. It’s also possible that X is doing so well that you want to end the trial early. Both of these are rare. Most of the time an interim analysis is more concerned with futility. You might want to stop the trial early not because the results are really good, or really bad, but because the results are really mediocre! That is, treatments X and Y are performing so similarly that you’re afraid that you won’t be able to declare one or the other better.
Maybe treatment X is doing a little better than Y, but not so much better that you can declare with confidence that X is better. You might want to stop for futility if you project that not only do you not have enough evidence now, you don’t believe you will have enough evidence by the end of the trial.
Futility analysis is more about resources than ethics. If X is doing poorly, ethics might dictate that you stop giving X to patients so you stop early. If X is doing spectacularly well, ethics might dictate that you stop giving the control treatment, if there is an active control. But if X is doing so-so, there’s usually not an ethical reason to stop, unless X is worse than Y on some secondary criteria, such as having worse side effects. You want to end futile studies so you can save resources and get on with the next study, and you could argue that’s an ethical consideration, though less direct.
Futility analysis isn’t about your current estimate of effectiveness. It’s about what you think you’re estimate regard effectiveness in the future. That is, it’s a second order prediction. You’re trying to understand the effectiveness of the trial, not of the treatment per se. You’re not trying to estimate a parameter, for example, but trying to estimate what range of estimates you’re likely to make.
This is why predictive probability is natural for interim analysis. You’re trying to predict outcomes, not parameters. (This is subtle: you’re trying to estimate the probability of outcomes that lead to certain estimates of parameters, namely those that allow you to reach a conclusion with pre-specified significance.)
Predictive probability is a Bayesian concept, but it is useful in analyzing frequentist trial designs. You may have frequentist conclusion criteria, such as a p-value threshhold or some requirements on a confidence interval, but you want to know how likely it is that if the trial continues, you’ll see data that lead to meeting your criteria. In that case you want to compute the (Bayesian) predictive probability of meeting your frequentist criteria!
We’re about to see a lot of new, powerful, inexpensive medical devices come out. And to my surprise, I’ve contributed to a few of them.
Growing compute power and shrinking sensors open up possibilities we’re only beginning to explore. Even when the things we want to observe elude direct measurement, we may be able to infer them from other things that we can now measure accurately, inexpensively, and in high volume.
In order to infer what you’d like to measure from what you can measure, you need a mathematical model. Or if you’d like to make predictions about the future from data collected in the past, you need a model. And that’s where I come in. Several companies have hired me to help them create medical devices by working on mathematical models. These might be statistical models, differential equations, or a combination of the two. I can’t say much about the projects I’ve worked on, at least not yet. I hope that I’ll be able to say more once the products come to market.
I started my career doing mathematical modeling (partial differential equations) but wasn’t that interested in statistics or medical applications. Then through an unexpected turn of events, I ended up spending a dozen years working in the biostatistics department of the world’s largest cancer center.
After leaving MD Anderson and starting my consultancy, several companies have approached me for help with mathematical problems associated with their idea for a medical device. These are ideal projects because they combine my earlier experience in mathematical modeling with my more recent experience with medical applications.
If you have an idea for a medical device, or know someone who does, let’s talk. I’d like to help.
Suppose you did a pilot study with 10 subjects and found a treatment was effective in 7 out of the 10 subjects.
With no more information than this, what would you estimate the probability to be that the treatment is effective in the next subject? Easy: 0.7.
Now what would you estimate the probability to be that the treatment is effective in the next two subjects? You might say 0.49, and that would be correct if we knewthat the probability of response is 0.7. But there’s uncertainty in our estimate. We don’t know that the response rate is 70%, only that we saw a 70% response rate in our small sample.
If the probability of success is p, then the probability of s successes and f failures in the next s + f subjects is given by
But if our probability of success has some uncertainty and we assume it has a beta(a, b) distribution, then the predictive probability of s successes and f failures is given by
In our example, after seeing 7 successes out of 10 subjects, we estimate the probability of success by a beta(7, 3) distribution. Then this says the predictive probability of two successes is approximately 0.51, a little higher than the naive estimate of 0.49. Why is this?
We’re not assuming the probability of success is 0.7, only that the mean of our estimate of the probability is 0.7. The actual probability might be higher or lower. The predictive probability calculates the probability of outcomes under all possible values of the probability, then creates a weighted average, weighing each probability of success by the probability of that value. The differences corresponding to probability above and below 0.7 approximately balance out, but the former carry a little more weight and so we get roughly what we did before.
If this doesn’t seem right, note that mean and median aren’t the same thing for asymmetric distributions. A beta(7,3) distribution has mean 0.7, but it has a probability of 0.537 of being larger than 0.7.
If our initial experiment has shown 70 successes out of 100 instead of 7 out of 10, the predictive probability of two successes would have been 0.492, closer to the value based on point estimate, but still different.
The further we look ahead, the more difference there is between using a point estimate and using a distribution that incorporates our uncertainty. Here are the probabilities for the number of successes out of the next 100 outcomes, using the point estimate 0.3 and using predictive probability with a beta(7,3) distribution.
So if we’re sure that the probability of success is 0.7, we’re pretty confident that out of 100 trials we’ll see between 60 and 80 successes. But if we model our uncertainty in the probability of response, we get quite a bit of uncertainty when we look ahead to the next 100 subjects. Now we can say that the number of responses is likely to be between 30 and 100.
I was catching up on Engines of our Ingenuity episodes this evening when the following line jumped out at me:
If I flip a coin a million times, I’m virtually certain to get 50 percent heads and 50 percent tails.
Depending on how you understand that line, it’s either imprecise or false. The more times you flip the coin, the more likely you are to get nearly half heads and half tails, but the less likely you are to get exactly half of each. I assume Dr. Lienhard knows this and that by “50 percent” he meant “nearly half.”
Let’s make the fuzzy statements above more quantitative. Suppose we flip a coin 2n times for some large number n. Then a calculation using Stirling’s approximation shows that the probability of n heads and n tails is approximately
which goes to zero as n goes to infinity. If you flip a coin a million times, there’s less than one chance in a thousand that you’d get exactly half heads.
Next, let’s quantify the statement that nearly half the tosses are likely to be heads. The normal approximation to the binomial tells us that for large n, the number of heads out of 2n tosses is approximately distributed like a normal distribution with the same mean and variance, i.e. mean n and variance n/2. The proportion of heads is thus approximately normal with mean 1/2 and variance 1/8n. This means the standard deviation is 1/√(8n). So, for example, about 95% of the time the proportion of heads will be 1/2 plus or minus 2/√(8n). As n goes to infinity, the width of this interval goes to 0. Alternatively, we could pick some fixed interval around 1/2 and show that the probability of the proportion of heads being outside that interval goes to 0.
Experience with the normal distribution makes people think all distributions have (useful) sufficient statistics . If you have data from a normal distribution, then the sufficient statistics are the sample mean and sample variance. These statistics are “sufficient” in that the entire data set isn’t any more informative than those two statistics. They effectively condense the data for you. (This is conditional on knowing the data come from a normal. More on that shortly.)
With data from other distributions, the mean and variance may not be sufficient statistics, and in fact there may be no (useful) sufficient statistics. The full data set is more informative than any summary of the data. But out of habit people may think that the mean and variance are enough.
Probability distributions are an idealization, of course, and so data never exactly “come from” a distribution. But if you’re satisfied with a distributional idealization of your data, there may be useful sufficient statistics.
Suppose you have data with such large outliers that you seriously doubt that it could be coming from anything appropriately modeled as a normal distribution. You might say the definition of sufficient statistics is wrong, that the full data set tells you something you couldn’t know from the summary statistics. But the sample mean and variance are still sufficient statistics in this case. They really are sufficient, conditional on the normality assumption, which you don’t believe! The cognitive dissonance doesn’t come from the definition of sufficient statistics but from acting on an assumption you believe to be false.
 Technically every distribution has sufficient statistics, though the sufficient statistic might be the same size as the original data set, in which case the sufficient statistic hasn’t contributed anything useful. Roughly speaking, distributions have useful sufficient statistics if they come from an “exponential family,” a set of distributions whose densities factor a certain way.
The Mittag-Leffler function is a generalization of the exponential function. Since k!= Γ(k + 1), we can write the exponential function’s power series as
and we can generalize this to the Mittag=Leffler function
which reduces to the exponential function when α = β = 1. There are a few other values of α and β for which the Mittag-Leffler function reduces to more familiar functions. For example,
where erfc(x) is the complementary error function.
Mittag-Leffler was one person, not two. When I first saw the Mittag-Leffler theorem in complex analysis, I assumed it was named after two people, Mittag and Leffler. But the theorem and the function discussed here are named after one man, the Swedish mathematician Magnus Gustaf (Gösta) Mittag-Leffler (1846–1927).
The function that Mr. Mittag-Leffler originally introduced did not have a β parameter; that generalization came later. The function Eα is Eα, 1.
Mittag-Leffler probability distributions
Just as you can make a couple probability distributions out of the exponential function, you can make a couple probability distributions out of the Mittag-Leffler function.
Continuous Mittag-Leffler distribution
The exponential function exp(-x) is positive over [0, ∞) and integrates to 1, so we can define a probability distribution whose density (PDF) function is f(x) = exp(-x) and whose distribution function (CDF) is F(x) = 1 – exp(-x). The Mittag-Leffler distribution has CDF is 1 – Eα(-xα) and so reduces to the exponential distribution when α = 1. For 0 < α < 1, the Mittag-Leffler distribution is a heavy-tailed generalization of the exponential. 
Discrete Mittag-Leffler distribution
The Poisson distribution comes from taking the power series for exp(λ), normalizing it to 1, and using the kth term as the probability mass for k. That is,
The analogous discrete Mittag-Leffler distribution  has probability mass function
Fractional differential equations
In addition to probability and statistics, the the Mittag-Leffler function comes up in fractional calculus. It plays a role analogous to that of the exponential distribution in classical calculus. Just as the solution to the simply differential equation
is exp(ax), for 0 < μ < 1, the soltuion to the fractional differential equation
is axμ-1Eμ, μ(axμ). Note that this reduces to exp(ax) when μ = 1. 
 Gwo Dong Lin. Journal of Statistical Planning and Inference 74 (1998) 1–9, On the Mittag–Leffler distributions
 Subrata Chakraborty, S. H. Ong. Mittag-Leffler function distribution: A new generalization of hyper-Poisson distribution. arXiv:1411.0980v1
 Keith Oldham, Jan Myland, Jerome Spanier. An Atlas of Functions. Springer.
Telephone numbers typically have around three distinct prime factors.
The length of a phone number varies by country, but US a phone number is a 10 digit number, and 10-digit numbers are often divisible by three different prime numbers, give or take a couple. Assuming phone numbers are scattered among possible 10-digit numbers in a way that doesn’t bias their number of prime factors, these numbers will often have between 1 and 5 prime factors. If a country has 9- or 11-digit phone numbers, the result is essentially the same.
Let ω(n) be the number of distinct prime factors of n. Then the Erdős–Kac theorem says roughly that ω(n) is distributed like a normal random variable with mean and variance log log n. More precisely, fix two numbers a and b. For a given value of x, count the proportion of positive integers less than x where
(ω(n) – log log n) / sqrt( log log n)
is between a and b. Then in the limit as x goes to infinity, this proportion approaches the probability that a standard normal random variable is between a and b.
So by that heuristic, the number of distinct prime factors of a 10-digit number is approximately normally distributed with mean and variance log log 10^11 = 3.232, and such a distribution is between 1 and 5 around 73% of the time.
My business phone number, for example, is 8324228646. Obviously this is divisible by 2. In fact it equals 2 × 32 × 462457147, and so it has exactly three distinct prime factors: 2, 3, and 462457147.
Here’s how you could play with this using Python.
from sympy.ntheory import factorint
I looked in SymPy and didn’t see an implementation of ω(n) directly, but it does have a function factorint that returns the prime factors of a number, along with their multiplicities, in a dictionary. So ω(n) is just the size of that dictionary.
I took the first 20 phone numbers in my contacts and ran them through omega and got results consistent with what you’d expect from the theory above. One was prime, and none had more than five factors.
A while back I wrote about how the leading digits of factorials follow Benford’s law. That is, if you look at just the first digit of a sequence of factorials, they are not evenly distributed. Instead, 1’s are most popular, then 2’s, etc. Specifically, the proportion of factorials starting with n is roughly log10(1 + 1/n).
Someone has proved that the limiting distribution of leading digits of factorials exactly satisfies Benford’s law. But if we didn’t know this, we might use a chi-square statistic to measure how well the empirical results match expectations. As I argued in the previous post, statistical tests don’t apply here, but they can be useful anyway in giving us a way to measure the size of the deviations from theory.
Benford’s law makes a better illustration of the chi-square test than the example of prime remainders because the bins are unevenly sized, which they’re allowed to be in general. In the prime remainder post, they were all the same size.
The original post on leading digits of factorial explains why we compute the leading digits the way we do. Only one detail has changed: the original post used Python 2 and this one uses Python 3. Integer division was the default in Python 2, but now in Python 3 we have to use // to explicitly ask for integer division, floating point division being the new default.
Here’s a plot of the distribution of the leading digits for the first 500 factorials.
And here’s code to compute the chi-square statistic:
from math import factorial, log10
while n > 9:
n = n//10
def chisq_stat(O, E):
return sum( [(o - e)**2/e for (o, e) in zip(O, E)] )
# Waste the 0th slot to make the code simpler.
digits = *10
N = 500
for i in range(N):
digits[ leading_digit_int( factorial(i) ) ] += 1
expected = [ N*log10(1 + 1/n) for n in range(1, 10) ]
print( chisq_stat(digits[1:], expected) )
This gives a chi-square statistic of 7.693, very near the mean value of 8 for a chi-square distribution with eight degrees of freedom. (There are eight degrees of freedom, not nine, because if we know how many factorials start with the digits 1 through 8, we know how many start with 9.)
So the chi-square statistic suggests that the deviation from Benford’s law is just what we’d expect from random data following Benford’s law. And as we said before, this suggestion turns out to be correct.
This post uses a hypothesis test for proportions to look at a couple conjectures in number theory. It is similar to my earlier post on the chi-square test and prime remainders. You could read this as a post on statistics or a post on number theory, depending on which you’re less familiar with.
Using statistical tests on number theory problems is kind of odd. There’s nothing random going on, so in that since the whole enterprise is unjustified. Nevertheless, statistical tests can be suggestive. They certainly don’t prove theorems, but they can give reason to suspect a theorem is true or false. In that sense, applying statistical tests to number theory isn’t all that different from applying them to more traditional settings.
First we’ll look at the remainders of primes modulo 4. Except for 2, all primes are odd, and so they either have remainder 1 or 3 when divided by 4. Brian Hayes wrote recently that Chebyshev noticed in the 1850’s that there seems to be more primes with remainder 3. Is the imbalance larger than one would expect to see from fair coin tosses?
Here’s some Python code to find the proportion of the first million primes (after 2) that have remainder 3 when divided by 4.
from sympy import prime
from scipy import sqrt
n = 1000000
rem3 = 0
for i in range (2, n+2):
if prime(i) % 4 == 3:
rem3 += 1
p_hat = rem3/n
This shows that of the first million odd primes, 500,202 are congruent to 3 mod 4. Would it be unusual for a coin to come up heads this many times in a million flips? To find out we’d compute the z-statistic:
Here p is the proportion under the null hypothesis, q = 1 – p, and n is the sample size. In our case, the null hypothesis is p = 0.5 and n = 1,000,000. 
p = 0.5
q = 1 - p
z = (p_hat - p)/sqrt(p*q/n)
shows that z = 0.404, hardly a suspiciously large value. If we were looking at random values we’d see a z-score this large or larger 34% of the time. So in this case doesn’t suggest much.
 The derivation of the z statistic is fairly quick. If the proportion of successes is p, then the number of successes out of n trials is binomial(n, p). For large n, this is has approximately the same distribution as a normal distribution with the same mean and variance, mean np and variance npq. The proportion of successes then has approximately mean p and standard deviation √(pq/n). Subtracting the mean and dividing by the standard deviation normalizes the distribution to have mean 0 and variance 1. So under the null hypothesis, the z statistic has a standard normal distribution.