Someone wrote to me the other day asking if I could explain a probability example from the Wall Street Journal. (“Proving Investment Success Takes Time,” Spencer Jakab, November 25, 2017.)

Victor Haghani … and two colleagues told several hundred acquaintances who worked in finance that they would flip two coins, one that was normal and the other that was weighted so it came up heads 60% of the time. They asked the people how many flips it would take them to figure out, with a 95% confidence level, which one was the 60% coin. Told to give a “quick guess,” nearly a third said fewer than 10 flips, while the median response was 40. The correct answer is 143.

The anecdote is correct in spirit: it takes longer to discover the better of two options than most people suppose. But it’s jarring to read the answer is precisely 143 when the question hasn’t been stated clearly.

How many flips would it take to figure out which coin is better with a 95% confidence level? For starters, the answer would have to be a distribution, not a single number. You might quickly come to the right conclusion. You *might* quickly come to the *wrong* conclusion. You might flip coins for a long time and never come to a conclusion. Maybe there is a way a formulating the problem so that so that the *expected value* of the distribution is 143.

How are you to go about flipping the coins? Do you flip both of them, or just flip one coin? For example, you might flip both coins until you are confident that one is better, and conclude that the better one is the one that was designed to come up heads 60% of the time. Or you could just flip one of them and test the hypothesis Prob(heads) = 0.5 versus the alternative Prob(heads) = 0.6. Or maybe you flip one coin two times for every one time you flip the other. Etc.

What do you mean by “95% confidence level”? Is this a frequentist confidence interval? And do you compute the (Bayesian) predictive probability of arriving at such a confidence level? Are you computing the (Bayesian) posterior model probabilities of two models, one in which the first coin has probability of heads 0.5 and the second has probability 0.6 versus the opposite model?

Do you assume that you know a priori that one coin has probability of heads 0.5 and the other 0.6, or do you not assume this and just want to find the coin with higher probability of heads, and evaluate such a model when in fact the probabilities of heads are as stated?

Are you conducting an experiment with a predetermined sample size of 143? Or are you continuous monitoring the data, stopping when you reach your conclusion?

I leave it as an exercise to the reader to implement the various alternatives suggested above and see whether one of them produces 143 as a result. (I did a a back-of-the-envelope calculation that suggests there is one.) So the first question is to reverse engineer which problem statement the article was based on. The second question is to decide which problem formulation you believe would be most appropriate in the context of the article.

]]>I will discuss, among other things, when common sense applies and when correct analysis can be counter-intuitive. There will be ample time at the end of the presentation for Q & A.

If you’re interested in attending, you can register here.

]]>When I saw a tweet from Tim Hopper a little while ago, my first thought was “How many bits of PII is that?”. [1]

π Things Only Left Handed Introverts Over 6′ 5″ with O+ Blood Type Will Appreciate

— Tim Hopper (@tdhopper) November 16, 2014

Let’s see. There’s some small correlation between these characteristics, but let’s say they’re independent. (For example, someone over 6′ 5″ is most likely male, and a larger percentage of males than females are left handed. But we’ll let that pass. This is just back-of-the-envelope reckoning.)

About 10% of the population is left-handed (11% for men, 9% for women) and so left-handedness caries -log_{2}(0.1) = 3.3 bits of information.

I don’t know how many people identify as introverts. I believe I’m a mezzovert, somewhere between introvert and extrovert, but I imagine when asked most people would pick “introvert” or “extrovert,” maybe half each. So we’ve got about one bit of information from knowing someone is an introvert.

The most common blood type in the US is O+ at 37% and so that carries 1.4 bits of information. (AB-, the most rare, corresponds to 7.4 bits of information. On average, blood type carries 2.2 bits of information in the US.)

What about height? Adult heights are approximately normally distributed, but not exactly. The normal approximation breaks down in the extremes, and we’re headed that way, but as noted above, this is just a quick and dirty calculation.

Heights in general don’t follow a normal distribution, but heights for men and women separately follow a normal. So for the general (adult) population, height follows a mixture distribution. Assume the average height for women is 64 inches, the average for men is 70 inches, and both have a standard deviation of 3 inches. Then the probability of a man being taller than 6′ 5″ would be about 0.001 and the probability of a woman being that tall would be essentially zero [2]. So the probability that a person is over 6′ 5″ would be about 0.0005, corresponding to about 11 bits of information.

All told, there are 16.7 bits of information in the tweet above, as much information as you’d get after 16 or 17 questions of the game Twenty Questions, assuming all your questions are independent and have probability 1/2 of being answered affirmative.

***

[1] PII = Personally Identifiable Information

[2] There are certainly women at least 6′ 5″. I can think of at least one woman I know who may be that tall. So the probability shouldn’t be less than 1 in 7 billion. But the normal approximation gives a probability of 8.8 × 10^{-15}. This is an example of where the normal distribution assumption breaks down in the extremes.

for *x* ≥ 1 where *a* > 0 is a shape parameter. The Pareto distribution and the Pareto principle (i.e. “80-20” rule) are named after the same person, the Italian economist Vilfredo Pareto.

Samples from a Pareto distribution obey Benford’s law in the limit as the parameter *a* goes to zero. That is, the smaller the parameter *a*, the more closely the distribution of the first digits of the samples come to following the distribution known as Benford’s law.

Here’s an illustration of this comparing the distribution of 1,000 random samples from a Pareto distribution with shape *a* = 1 and shape *a* = 0.2 with the counts expected under Benford’s law.

Note that this has nothing to do with base 10 per se. If we look at the leading digits as expressed in any other base, such as base 16 below, we see the same pattern.

**More posts on Benford’s law**

- Weibull distribution and Benford’s law
- Benford’s law, chi-square, and factorials
- Benford’s law and SciPy constants

More posts on Pareto

Here are some posts on testing a uniform RNG.

Here’s a book chapter I wrote on testing the transformation of a uniform RNG into some other distribution.

A few posts on manipulating a random number generator.

- Manipulating a random number generator
- Reverse engineering the seed of an LCG
- Predicting when an RNG will output a given value

And finally, a post on a cryptographically secure random number generator.

]]>For more on the beta-binomial model itself, see A Bayesian view of Amazon Resellers and Functional Folds and Conjugate Models.

I mentioned in a recent post that the Kullback-Leibler divergence from the prior distribution to the posterior distribution is a measure of how much information was gained.

Here’s a little Python code for computing this. Enter the *a* and *b* parameters of the prior and the posterior to compute how much information was gained.

from scipy.integrate import quad from scipy.stats import beta as beta from scipy import log2 def infogain(post_a, post_b, prior_a, prior_b): p = beta(post_a, post_b).pdf q = beta(prior_a, prior_b).pdf (info, error) = quad(lambda x: p(x) * log2(p(x) / q(x)), 0, 1) return info

This code works well for medium-sized inputs. It has problems with large inputs because the generic integration routine `quad`

needs some help when the beta distributions become more concentrated.

You can see that surprising input carries more information. For example, suppose your prior is beta(3, 7). This distribution has a mean of 0.3 and so your expecting more failures than successes. With such a prior, a success changes your mind more than a failure does. You can quantify this by running these two calculations.

print( infogain(4, 7, 3, 7) ) print( infogain(3, 8, 3, 7) )

The first line shows that a success would change your information by 0.1563 bits, while the second shows that a failure would change it by 0.0297 bits.

]]>]]>

Companies often use an old version of their production database for testing. But what if the production database has **sensitive information** that software developers and testers should not have access to?

You can’t completely remove customer phone numbers from the database, for example, if your software handles customer phone numbers. You have to replace in sensitive data with modified data. The question becomes how to modify it. Three approaches would be

- Use the original data.
- Generate completely new artificial data.
- Use the real data as a guide to generating new data.

We’ll assume the first option is off the table and consider the pros and cons of the other two options.

For example, suppose you collect customer ages. You could replace customer age with a random two-digit number. That’s fine as far as making sure that forms can display two-digit numbers. But maybe the age values matter. Maybe you want your fictional customers in the test database to have the same age distribution as your real customers. Or maybe you want your fictional customer ages to be correlated with other attributes so that you don’t have 11 year-old retirees or 98 year-old clients who can’t legally purchase alcohol.

There are pros and cons to having a realistic test database. A database filled with randomly generated data is likely to find **more bugs**, but a realistic database is likely to find **more important bugs**.

Randomly generated data may contain combinations that have yet to occur in the production data, combinations that will cause an error when they do come up in production. Maybe you’ve never sold your product to someone in Louisiana, and there’s a latent bug that will show up the first time someone from Louisiana does order. (For example, Louisiana retains vestiges of French law that make it different from all other states.)

On the other hand, randomly generated data may not find the bugs that affect the most customers. You might want the values in your test database to be distributed similarly to the values in real data so that bugs come up in testing with roughly the same frequency as in production. In that case, you probably want the *joint* distributions to match and not just the *unconditional* distributions. If you just match the latter, you could run into oddities such as a large number of teenage retirees as mentioned above.

So do you want a random test database or a realistic test database? Maybe both. It depends on your purposes and priorities. You might want to start by testing against a realistic database so that you first find the bugs that are likely to affect the most number of customers. Then maybe you switch to a randomized database that is more effective at flushing out problems with edge cases.

So how would you go about creating a realistic test database that protects customer privacy? The answer depends on several factors. First of all, it depends on what aspects of the real data you want to preserve. Maybe verisimilitude is more important for some fields than others. Once you decide what aspects you want your test database to approximate, how well do you need to approximate them? If you want to do valid statistical analysis on the test database, you may need something sophisticated like **differential privacy**. But if you just want moderately realistic test cases, you can do something much simpler.

Finally, you have to address your **privacy-utility trade-off**. What kinds of privacy protection are you ethically and legally obligated to provide? For example, is your data consider PHI under HIPAA regulation? Once your privacy obligations are clear, you look for ways to maximize your utility subject to these privacy constraints.

If you’d like help with this process, let’s talk. I can help you determine what your obligations are and how best to meet them while meeting your business objectives.

]]>In the previous post we looked at a simple randomization procedure to obscure individual responses to yes/no questions in a way that retains the statistical usefulness of the data. In this post we’ll generalize that procedure, quantify the privacy loss, and discuss the utility/privacy trade-off.

Suppose we have a binary response to some question as a field in our database. With probability *t* we leave the value alone. Otherwise we replace the answer with the result of a fair coin toss. In the previous post, what we now call *t* was implicitly equal to 1/2. The value recorded in the database could have come from a coin toss and so the value is not definitive. And yet it does contain some information. The posterior probability that the original answer was 1 (“yes”) is higher if a 1 is recorded. We did this calculation for *t* = 1/2 last time, and here we’ll look at the result for general *t*.

If *t* = 0, the recorded result is always random. The field contains no private information, but it is also statistically useless. At the opposite extreme, *t* = 1, the recorded result is pure private information and statistically useful. The closer *t* is to 0, the more privacy we have, and the closer *t* is to 1, the more useful the data is. We’ll quantify this privacy/utility trade-off below.

You can go through an exercise in applying Bayes theorem as in the previous post to show that the probability that the original response is 1, given that the recorded response is 1, is

where *p* is the overall probability of a true response of 1.

The **privacy loss** associated with an observation of 1 is the gain in information due to that observation. Before knowing that a particular response was 1, our estimate that the true response was 1 would be *p*; not having any individual data, we use the group mean. But after observing a recorded response of 1, the posterior probability is the expression above. The information gain is the log base 2 of the ratio of these values:

When *t* = 0, the privacy loss is 0. When *t* = 1, the loss is -log_{2}(*p*) bits, i.e. the entire information contained in the response. When *t* = 1/2, the loss is -log_{2}(3/(2*p* + 1)) bits.

We’ve looked at the privacy cost of setting *t* to various values. What are the statistical costs? Why not make *t* as small as possible? Well, 0 is a possible value of *t*, corresponding to complete loss of statistical utility. So we’d expect that small positive values of *t* make it harder to estimate *p*.

Each recorded response is a 1 with probability *tp* + (1 – *t*)/2. Suppose there are *N* database records and let *S* be the sum of the recorded values. Then our estimator for *p* is

The variance of this estimator is inversely proportional to *t*, and so the width of our confidence intervals for *p* are proportional to 1/√*t*. Note that the larger *N* is, the smaller we can afford to make *t*.

Previous related posts:

- Randomized response, privacy, and Bayes theorem
- Quantifying the information content of personal data
- HIPAA de-identification
- Information theory
- Database anonymization

**Next up**: Adding Laplace or Gaussian noise and differential privacy

Suppose you want to gather data on an incriminating question. For example, maybe a statistics professor would like to know how many students cheated on a test. Being a statistician, the professor has a clever way to find out what he wants to know while giving each student deniability.

Each student is asked to flip two coins. If the first coin comes up heads, the student answers the question truthfully, yes or no. Otherwise the student reports “yes” if the second coin came up heads and “no” it came up tails. Every student has deniability because each “yes” answer may have come from an innocent student who flipped tails on the first coin and heads on the second.

How can the professor estimate *p*, the proportion of students who cheated? Around half the students will get a head on the first coin and answer truthfully; the rest will look at the second coin and answer yes or no with equal probability. So the expected proportion of yes answers is *Y* = 0.5*p* + 0.25, and we can estimate *p* as 2*Y* – 0.5.

The calculations above assume that everyone complied with the protocol, which may not be reasonable. If everyone were honest, there’d be no reason for this exercise in the first place. But we could imagine another scenario. Someone holds a database with identifiers and answers to a yes/no question. The owner of the database could follow the procedure above to introduce randomness in the data before giving the data over to someone else.

What can we infer from someone’s randomized response to the cheating question? There’s nothing you can infer with *certainty*; that’s the point of introducing randomness. But that doesn’t mean that the answers contain no information. If we completely randomized the responses, dispensing with the first coin flip, *then* the responses would contain no information. The responses *do* contain information, but not enough to be incriminating.

Let *C* be a random variable representing whether someone cheated, and let *R* be their response, following the randomization procedure above. Given a response *R* = 1, what is the probability *p* that *C* = 1, i.e. that someone cheated? This is a classic application of Bayes’ theorem.

If we didn’t know someone’s response, we would estimate their probability of having cheated as *p*, the group average. But knowing that their response was “yes” we update our estimate to 3*p* / (2*p* + 1). At the extremes of *p* = 0 and *p* = 1 these coincide. But for any value of *p* strictly between 0 and 1, our estimate goes up. That is, the probability that someone cheated, conditional on knowing they responded “yes”, is higher than the unconditional probability. In symbols, we have

when 0 < *p *< 1. The difference between the left and right sides above is maximized when *p* = (√3 – 1)/2 = 0.366. That is, a “yes” response tells us the most when about 1/3 of the students cheated. When *p* = 0.366, *P*(*C *= 1 | *R*= 1) = 0.634, i.e. the posterior probability is almost twice the prior probability.

You could go through a similar exercise with Bayes theorem to show that *P*(*C* = 1 | *R* = 0) = *p*/(3 – 2*p*), which is less than *p* provided 0 < *p* < 1. So if someone answers “yes” to cheating, that does make it more likely that the actually cheated, but not so much more that you can justly accuse them of cheating. (Unless *p* = 1, in which case you’re in the realm of logic rather than probability: if everyone cheated, then you can conclude that any individual cheated.)

**Update**: See the next post for a more general randomization scheme and more about the trade-off between privacy and utility. The post after that gives an overview of randomization for more general kinds of data.

If you would like help with database de-identification, please let me know.

]]>If the answer to a question has probability *p*, then it contains -log_{2} *p* **bits of information**. Knowing someone’s sex gives you 1 bit of information because -log_{2}(1/2) = 1.

Knowing whether someone can roll their tongue could give you more or less information than knowing their sex. Estimates vary, but say 75% can roll their tongue. Then knowing that someone *can* roll their tongue gives you 0.415 bits of information, but knowing that they *cannot* roll their tongue gives you 2 bits of information.

On *average*, knowing someone’s tongue rolling ability gives you less information than knowing their sex. The average amount of information, or **entropy**, is

0.75(-log_{2} 0.75) + 0.25(-log_{2} 0.25) = 0.81.

Entropy is maximized when all outcomes are equally likely. But for identifiability, we’re concerned with maximum information as well as average information.

Knowing someone’s zip code gives you a variable amount of information, less for densely populated zip codes and more for sparsely populated zip codes. An average zip code contains about 7,500 people. If we assume a US population of 326,000,000, this means a typical zip code would give us about 15.4 bits of information.

The Safe Harbor provisions of US HIPAA regulations let you use the first three digits of someone’s zip code except when this would represent less than 20,000 people, as it would in several sparsely populated areas. Knowing that an American lives in a region of 20,000 people would give you 14 bits of information about that person.

Birth dates are complicated because age distribution is uneven. Knowing that someone’s birth date was over a century ago is highly informative, much more so than knowing it was a couple decades ago. That’s why the Safe Harbor provisions do not allow including age, much less birth date, for people over 90.

Birthdays are simpler than birth dates. Birthdays are not perfectly evenly distributed throughout the year, but they’re close enough for our purposes. If we ignore leap years, a birthday contains -log_{2}(1/365) or about 8.5 bits of information. If we consider leap years, knowing someone was born on a leap day gives us two extra bits of information.

Independent information is additive. I don’t expect there’s much correlation between sex, geographical region, and birthday, so you could add up the bits from each of these information sources. So if you know someone’s sex, their zip code (assuming 7,500 people), and their birthday (not a leap day), then you have 25 bits of information, which may be enough to identify them.

This post didn’t consider correlated information. For example, suppose you know someone’s zip code and primary language. Those two pieces of information together don’t provide as much information as the sum of the information they provide separately because language and location are correlated. I may discuss the information content of correlated information in a future post. (**Update**: Here is a post on correlated pairs of data.)

**Related**: HIPAA de-identification

This article gives the following example. Suppose beauty and acting ability were uncorrelated. Knowing how attractive someone is would give you no advantage in guessing their acting ability, and vice versa. Suppose further that successful actors have a combination of beauty and acting ability. Then among successful actors, the beautiful would tend to be poor actors, and the unattractive would tend to be good actors.

Here’s a little Python code to illustrate this. We take two independent attributes, distributed like IQs, i.e. normal with mean 100 and standard deviation 15. As the sum of the two attributes increases, the correlation between the two attributes becomes more negative.

from numpy import arange from scipy.stats import norm, pearsonr import matplotlib.pyplot as plt # Correlation. # The function pearsonr returns correlation and a p-value. def corr(x, y): return pearsonr(x, y)[0] x = norm.rvs(100, 15, 10000) y = norm.rvs(100, 15, 10000) z = x + y span = arange(80, 260, 10) c = [ corr( x[z > low], y[z > low] ) for low in span ] plt.plot( span, c ) plt.xlabel( "minimum sum" ) plt.ylabel( "correlation coefficient" ) plt.show()]]>

]]>GCHQ in the ’70s, we thought of ourselves as completely Bayesian statisticians. All our data analysis was completely Bayesian, and that was a direct inheritance from Alan Turing. I’m not sure this has ever really been published, but Turing, almost as a sideline during his cryptoanalytic work, reinvented Bayesian statistics for himself. The work against Enigma and other German ciphers was fully Bayesian. …

Bayesian statistics was an

extrememinority discipline in the ’70s. In academia, I only really know of two people who were working majorly in the field, Jimmy Savage … in the States and Dennis Lindley in Britain. And they were regarded as fringe figures in the statistics community. It’s extremely different now. The reason is that Bayesian statisticsworks. So eventually truth will out. There are many, many problems where Bayesian methods are obviously the right thing to do. But in the ’70s we understood that already in Britain in the classified environment.

The journal article announcing PCG gives the results of testing it with the TestU01 test suite. I wanted to try it out by testing it with the DIEHARDER test suite (Robert G. Brown’s extension of George Marsaglia’s DIEHARD test suite) and the NIST Statistical Test Suite. I used what the generator’s website calls the “minimal C implementation.”

The preprint of the journal article is dated 2015 but apparently hasn’t been published yet.

**Update**: See the very informative note by the author of PCG in the comments below.

For the NIST test suite, I generated 10,000,000 bits and divided them into 10 streams.

For the DIEHARDER test suite, I generated 800,000,000 unsigned 32-bit integers. (DIEHARDER requires a **lot** of random numbers as input.)

For both test suites I used the seed (`state`

) 20170707105851 and sequence constant (`inc`

) 42.

The PCG generator did well on all the NIST tests. For every test, at least least 9 out of 10 streams passed. The test authors say you should expect at least 8 out of 10 streams to pass.

Here’s an excerpt from the results. You can find the full results here.

-------------------------------------------------------- C1 C2 C3 ... C10 P-VALUE PROPORTION STATISTICAL TEST -------------------------------------------------------- 2 0 2 0 0.213309 10/10 Frequency 0 0 1 3 0.534146 10/10 BlockFrequency 3 0 0 0 0.350485 10/10 CumulativeSums 1 1 0 2 0.350485 10/10 CumulativeSums 0 2 2 1 0.911413 10/10 Runs 0 0 1 1 0.534146 10/10 LongestRun 0 1 2 0 0.739918 10/10 Rank 0 4 0 0 0.122325 10/10 FFT 1 0 0 1 0.000439 10/10 NonOverlappingTemplate ... 2 1 0 0 0.350485 9/10 NonOverlappingTemplate 0 2 1 0 0.739918 10/10 OverlappingTemplate 1 1 0 2 0.911413 10/10 Universal 1 1 0 0 0.017912 10/10 ApproximateEntropy 1 0 1 1 ---- 3/4 RandomExcursions ... 0 0 0 1 ---- 4/4 RandomExcursions 2 0 0 0 ---- 4/4 RandomExcursionsVariant ... 0 0 3 0 ---- 4/4 RandomExcursionsVariant 1 2 3 0 0.350485 9/10 Serial 1 1 1 0 0.739918 10/10 Serial 1 2 0 0 0.911413 10/10 LinearComplexity ...

The DIEHARDER suite has 31 kinds tests, some of which are run many times, making a total of 114 tests. Out of the 114 tests, two returned a weak pass for the PCG input and all the rest passed. A few weak passes are to be expected from running so many tests and so this isn’t a strike against the generator. In fact, it might be suspicious if no tests returned a weak pass.

Here’s an edited version of the results. The full results are here.

#=============================================================================# test_name |ntup| tsamples |psamples| p-value |Assessment #=============================================================================# diehard_birthdays| 0| 100| 100|0.46682782| PASSED diehard_operm5| 0| 1000000| 100|0.83602120| PASSED diehard_rank_32x32| 0| 40000| 100|0.11092547| PASSED diehard_rank_6x8| 0| 100000| 100|0.78938803| PASSED diehard_bitstream| 0| 2097152| 100|0.81624396| PASSED diehard_opso| 0| 2097152| 100|0.95589325| PASSED diehard_oqso| 0| 2097152| 100|0.86171368| PASSED diehard_dna| 0| 2097152| 100|0.24812341| PASSED diehard_count_1s_str| 0| 256000| 100|0.75417270| PASSED diehard_count_1s_byt| 0| 256000| 100|0.25725000| PASSED diehard_parking_lot| 0| 12000| 100|0.59288414| PASSED diehard_2dsphere| 2| 8000| 100|0.79652706| PASSED diehard_3dsphere| 3| 4000| 100|0.14978100| PASSED diehard_squeeze| 0| 100000| 100|0.35356584| PASSED diehard_sums| 0| 100| 100|0.04522121| PASSED diehard_runs| 0| 100000| 100|0.39739835| PASSED diehard_runs| 0| 100000| 100|0.99128296| PASSED diehard_craps| 0| 200000| 100|0.64934221| PASSED diehard_craps| 0| 200000| 100|0.27352733| PASSED marsaglia_tsang_gcd| 0| 10000000| 100|0.10570816| PASSED marsaglia_tsang_gcd| 0| 10000000| 100|0.00267789| WEAK sts_monobit| 1| 100000| 100|0.98166534| PASSED sts_runs| 2| 100000| 100|0.05017630| PASSED sts_serial| 1| 100000| 100|0.95153782| PASSED ... sts_serial| 16| 100000| 100|0.59342390| PASSED rgb_bitdist| 1| 100000| 100|0.50763759| PASSED ... rgb_bitdist| 12| 100000| 100|0.98576422| PASSED rgb_minimum_distance| 2| 10000| 1000|0.23378443| PASSED ... rgb_minimum_distance| 5| 10000| 1000|0.13215367| PASSED rgb_permutations| 2| 100000| 100|0.54142546| PASSED ... rgb_permutations| 5| 100000| 100|0.96040216| PASSED rgb_lagged_sum| 0| 1000000| 100|0.66587166| PASSED ... rgb_lagged_sum| 31| 1000000| 100|0.00183752| WEAK rgb_lagged_sum| 32| 1000000| 100|0.13582393| PASSED rgb_kstest_test| 0| 10000| 1000|0.74708548| PASSED dab_bytedistrib| 0| 51200000| 1|0.30789191| PASSED dab_dct| 256| 50000| 1|0.89665788| PASSED dab_filltree| 32| 15000000| 1|0.67278231| PASSED dab_filltree| 32| 15000000| 1|0.35348003| PASSED dab_filltree2| 0| 5000000| 1|0.18749029| PASSED dab_filltree2| 1| 5000000| 1|0.92600020| PASSED]]>

MCMC (Markov Chain Monte Carlo) gives us a way around this impasse. It lets us draw samples from practically any probability distribution. But there’s a catch: the samples are not independent. This lack of independence means that all the familiar theory on convergence of sums of random variables goes out the window.

There’s not much theory to guide assessing the convergence of sums of MCMC samples, but there are heuristics. One of these is **effective sample size **(ESS). The idea is to have a sort of “exchange rate” between dependent and independent samples. You might want to say, for example, that 1,000 samples from a certain Markov chain are worth about as much as 80 independent samples because the MCMC samples are highly correlated. Or you might want to say that 1,000 samples from a different Markov chain are worth about as much as 300 independent samples because although the MCMC samples are dependent, they’re weakly correlated.

Here’s the definition of ESS:

where *n* is the number of samples and ρ(*k*) is the correlation at lag *k*.

This behaves well in the extremes. If your samples are independent, your effective samples size equals the actual sample size. If the correlation at lag *k* decreases extremely slowly, so slowly that the sum in the denominator diverges, your effective sample size is zero.

Any reasonable Markov chain is between the extremes. Zero lag correlation is too much to hope for, but ideally the correlations die off fast enough that the sum in the denominator not only converges but also isn’t a terribly large value.

I’m not sure who first proposed this definition of ESS. There’s a reference to it in Handbook of Markov Chain Monte Carlo where the authors cite a paper [1] in which Radford Neal mentions it. Neal cites B. D. Ripley [2].

[1] Markov Chain Monte Carlo in Practice: A Roundtable Discussion. Robert E. Kass, Bradley P. Carlin, Andrew Gelman and Radford M. Neal. The American Statistician. Vol. 52, No. 2 (May, 1998), pp. 93-100

[2] Stochlastic Simulation, B. D. Ripley, 1987.

]]>The confidence region for the object flares out over time, something like the bell of a trumpet.

**Why does the region get larger**? Because there’s uncertainty in the velocity, and the velocity gets multiplied by elapsed time.

**Why isn’t the confidence region a cone**? Because that would ignore the uncertainty in the initial position. The result would be too small.

**Why isn’t the confidence region a truncated cone**? That’s not a bad approximation, though it’s a bit too large. If we ignore probability for a moment and treat confidence intervals as deterministic limits, then we get a truncated cone. For example, suppose assume position and velocity are each within two standard deviations of their estimates. Then we’d estimate position to be between *x*_{0} – 2σ_{x} + (*v_{0}* – 2σ

**So what is the confidence region**? It’s some where between the cone and the truncated cone.

The position *x* + *t* *v* is the sum of two random variables. The first has variance σ_{x}² and the second has variance *t*² σ_{v}². Variances of independent random variables add, so the standard deviation for the sum is

√(σ_{x}² + *t*² σ_{v}²) = *t* √(σ_{x}² / *t*² + σ_{v}²)

Note that as *t* increases, the latter approaches *t* σ_{v} from above. Ignoring the uncertainty in initial position underestimates standard deviation, but the relative error decreases as *t* increases.

For large *t*, a confidence interval for position at time *t* is approximately proportional to *t*, so the width of the confidence intervals over time look like a cone. But from small *t*, the dependence on *t* is less linear and more curved.

If *a* and *b* are close, they don’t have to be very large for the beta PDF to be approximately normal. (In all the plots below, the solid blue line is the beta distribution and the dashed orange line is the normal distribution with the same mean and variance.)

On the other hand, when *a* and *b* are very different, the beta distribution can be skewed and far from normal. Note that *a* + *b* is the same in the example above and below.

Why the sharp corner above? The beta distribution is only defined on the interval [0, 1] and so the PDF is zero for negative values.

An application came up today that raised an interesting question: What if *a* + *b* is very large, but *a* and *b* are very different? The former works in favor of the normal approximation but the latter works against it.

The application had a low probability of success but a very large number of trials. Specifically, *a* + *b* would be on the order of a million, but *a* would be less than 500. Does the normal approximation hold for these kinds of numbers? Here are some plots to see.

When *a* = 500 the normal approximation is very good. It’s still pretty good when *a* = 50, but not good at all when *a* = 5.

**Update**: Mike Anderson suggested using the skewness to quantify how far a beta is from normal. Good idea.

The skewness of a beta(*a*, *b*) distribution is

2(*b* – *a*)√(*a* +* b* + 1) / (*a* + *b* + 2) √(*ab*)

Let *N* = *a* + *b* and assume *N* is large and *a* is small, so that *N*, *N* + 2, *b – **a*, and *N* – *a* are all approximately equal in ratios. Then the skewness is approximately 2 /√*a*. In other words, once the number of trials is sufficiently large, sknewness hardly depends on the number of trials and only depends on the number of successes.

The previous post said that for almost all *x* > 1, the fractional parts of the powers of *x* are uniformly distributed. Although this is true for almost all *x*, it can be hard to establish for any particular *x*. The previous post ended with the question of whether the fractional parts of the powers of 3/2 are uniformly distributed.

First, lets just plot the sequence (3/2)^{n} mod 1.

Looks kinda random. But is it uniformly distributed? One way to tell would be to look at the empirical cumulative distribution function (ECDF) and see how it compares to a uniform cumulative distribution function. This is what a quantile-quantile plot does. In our case we’re looking to see whether something has a uniform distribution, but you could use a q-q plot for any distribution. It may be most often used to test normality by looking at whether the ECDF looks like a normal CDF.

If a sequence is uniformly distributed, we would expect 10% of the values to be less than 0.1. We would expect 20% of the values to be less than 0.2. Etc. In other words, we’d expect the *quantiles* to line up with their theoretical values, hence the name “quantile-quantile” plot. On the horizontal axis we plot uniform values between 0 and 1. On the vertical axis we plot the sorted values of (3/2)^{n} mod 1.

A qq-plot indicates a good fit when values line up near the diagonal, as they do here.

For contrast, let’s look at a qq-plot for the powers of the plastic constant mod 1.

Here we get something very far from the diagonal line. The plot is flat on the left because many of the values are near 0, and it’s flat on the right because many values are near 1.

Incidentally, the Kolmogorov-Smirnov goodness of fit test is basically an attempt to quantify the impression you get from looking at a q-q plot. It’s based on a statistic that measures how far apart the empirical CDF and theoretical CDF are.

]]>In his paper Mindless statistics, Gerd Gigerenzer uses a Freudian analogy to describe the mental conflict researchers experience over statistical hypothesis testing. He says that the “statistical ritual” of NHST (null hypothesis significance testing) “is a form of conflict resolution, like compulsive hand washing.”

In Gigerenzer’s analogy, the **id** represents Bayesian analysis. Deep down, a researcher wants to know the probabilities of hypotheses being true. This is something that Bayesian statistics makes possible, but more conventional frequentist statistics does not.

The **ego** represents R. A. Fisher’s significance testing: specify a null hypothesis only, not an alternative, and report a *p*-value. Significance is calculated after collecting the data. This makes it easy to publish papers. The researcher never clearly states his hypothesis, and yet takes credit for having established it after rejecting the null. This leads to feelings of guilt and shame.

The **superego** represents the Neyman-Pearson version of hypothesis testing: pre-specified alternative hypotheses, power and sample size calculations, etc. Neyman and Pearson insist that hypothesis testing is about what to *do*, not what to *believe*. [1]

I assume Gigerenzer doesn’t take this analogy too seriously. In context, it’s a humorous interlude in his polemic against rote statistical ritual.

But there really is a conflict in hypothesis testing. Researchers naturally think in Bayesian terms, and interpret frequentist results as if they were Bayesian. They really do want probabilities associated with hypotheses, and will imagine they have them even though frequentist theory explicitly forbids this. The rest of the analogy, comparing the ego and superego to Fisher and Neyman-Pearson respectively, seems weaker to me. But I suppose you could imagine Neyman and Pearson playing the role of your conscience, making you feel guilty about the pragmatic but unprincipled use of *p*-values.

* * *

[1] “No test based upon a theory of probability can by itself provide any valuable evidence of the truth or falsehood of a hypothesis. But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern behaviour in regard to them, in following which we insure that, in the long run of experience, we shall not often be wrong.”

Neyman J, Pearson E. On the problem of the most efficient tests of statistical hypotheses. *Philos Trans Roy Soc A*, 1933;231:289, 337.

Excerpt from the new book Big Data of Complex Networks:

Big Data and data protection law provide for a number of mutual conflicts: from the perspective of Big Data analytics,

. From the perspective of the law, Big Data is either a big threat … or a major challenge for international and national lawmakers to adopt today’s data protection laws to the latest technological and economic developments.a strict application of data protection law as we know it today would set an immediate end to most Big Data applications

Emphasis added.

The author of the chapter on legal matters is Swiss and writes primarily in a European context, though all countries face similar problems.

I’m not a lawyer, though I sometimes work with lawyers as a technical expert, and sometimes help companies with the statistical aspects of HIPAA law. But as a layman the observation above sounds reasonable to me, that strict application of the law could bring many applications to a halt, for better and for worse.

In my opinion the regulations around HIPAA and de-identification are mostly reasonable. The things it prohibits mostly should be prohibited. And it has a common sense provision in the form of expert determination. If your data uses fall outside the regulation’s specific recommendations but don’t endanger privacy, you can have an expert can certify that this is the case.

**Related**:

Bayesian methods are often characterized as “subjective” because the user must choose a prior distribution, that is, a mathematical expression of prior information. The prior distribution requires information and user input, that’s for sure, but I don’t see this as being any more “subjective” than other aspects of a statistical procedure, such as the choice of model for the data (for example, logistic regression) or the choice of which variables to include in a prediction, the choice of which coefficients should vary over time or across situations, the choice of statistical test, and so forth. Indeed, Bayesian methods can in many ways be more “objective” than conventional approaches in that Bayesian inference, with its smoothing and partial pooling, is well adapted to including diverse sources of information and thus can reduce the number of data coding or data exclusion choice points in an analysis.

People worry about prior distributions, not because they’re subjective, but because they’re *explicitly* subjective. There are many other subjective factors, common to Bayesian and Frequentist statistics, but these are *implicitly* subjective.

In practice, prior distributions often don’t make much difference. For example, you might show that an optimistic prior and a pessimistic prior lead to the same conclusion.

If you have so little data that the choice of prior does make a substantial difference, being able to specify the prior is a benefit. Suppose you only have a little data but have to make a decision anyway. A frequentist might say there’s too little data for a statistical analysis to be meaningful. So what do you do? Make a decision entirely subjectively! But with a Bayesian approach, you capture what is known outside of the data at hand in the form of a prior distribution, and update the prior with the little data you have. In this scenario, a Bayesian analysis is less subjective and more informed by the data than a frequentist approach.

]]>It’s possible that treatment *X* is doing so poorly that you want to end the trial without going any further. It’s also possible that *X* is doing so well that you want to end the trial early. Both of these are rare. Most of the time an interim analysis is more concerned with **futility**. You might want to stop the trial early not because the results are really good, or really bad, but because the results are really mediocre! That is, treatments *X *and *Y* are performing so similarly that you’re afraid that you won’t be able to declare one or the other better.

Maybe treatment *X* is doing a little better than *Y*, but not so much better that you can declare with confidence that *X* is better. You might want to stop for futility if you project that not only do you not have enough evidence now, you don’t believe you will have enough evidence by the end of the trial.

Futility analysis is more about resources than ethics. If *X* is doing poorly, ethics might dictate that you stop giving *X* to patients so you stop early. If *X* is doing spectacularly well, ethics might dictate that you stop giving the control treatment, if there is an active control. But if *X* is doing so-so, there’s usually not an ethical reason to stop, unless *X* is worse than *Y* on some secondary criteria, such as having worse side effects. You want to end futile studies so you can save resources and get on with the next study, and you could argue that’s an ethical consideration, though less direct.

Futility analysis isn’t about your current estimate of effectiveness. It’s about what you think you’re estimate regard effectiveness in the future. That is, it’s a second order prediction. You’re trying to understand the effectiveness of the **trial**, not of the **treatment** per se. You’re not trying to estimate a parameter, for example, but trying to estimate what range of estimates you’re likely to make.

This is why **predictive probability** is natural for interim analysis. You’re trying to predict **outcomes**, not **parameters**. (This is subtle: you’re trying to estimate the probability of **outcomes **that lead to certain estimates of **parameters**, namely those that allow you to reach a conclusion with pre-specified significance.)

Predictive probability is a Bayesian concept, but it is useful in analyzing frequentist trial designs. You may have frequentist conclusion criteria, such as a *p*-value threshhold or some requirements on a confidence interval, but you want to know how likely it is that if the trial continues, you’ll see data that lead to meeting your criteria. In that case you want to compute the (**Bayesian**) predictive probability of meeting your **frequentist** criteria!

Most people learn R as they learn statistics: Here’s a statistical concept, and here’s how you can compute it in R. Statisticians aren’t that interested in the R language itself but see it as connective tissue between commands that are their primary interest.

This works for statisticians, but it makes the language hard for non-statisticians to approach. Years ago I managed a group of programmers who supported statisticians. At the time, there were no books for learning R without concurrently learning statistics. This created quite a barrier to entry for programmers whose immediate concern was not the statistical content of an R program.

Now there are more books on R, and some are more approachable to non-statisticians. The most accessible one I’ve seen so far is Learning Base R by Lawrence Leemis. It gets into statistical applications of R—that is ultimately why anyone is interested in R—but it doesn’t *start* there. The first 40% or so of the book is devoted to basic language features, things you’re supposed to pick up by osmosis from a book focused more on statistics than on R *per se*. This is the book I wish I could have handed my programmers who had to pick up R.

With no more information than this, what would you estimate the probability to be that the treatment is effective in the next subject? Easy: 0.7.

Now what would you estimate the probability to be that the treatment is effective in the next **two** subjects? You might say 0.49, and that would be correct if we **knew*** *that the probability of response is 0.7. But there’s uncertainty in our estimate. We don’t know that the response rate is 70%, only that we saw a 70% response rate in our small sample.

If the probability of success is *p*, then the probability of *s* successes and *f* failures in the next *s* + *f* subjects is given by

But if our probability of success has some uncertainty and we assume it has a beta(*a*, *b*) distribution, then the **predictive probability** of *s* successes and *f* failures is given by

where

In our example, after seeing 7 successes out of 10 subjects, we estimate the probability of success by a beta(7, 3) distribution. Then this says the predictive probability of two successes is approximately 0.51, a little higher than the naive estimate of 0.49. Why is this?

We’re not assuming the probability of success is 0.7, only that the mean of our estimate of the probability is 0.7. The actual probability might be higher or lower. The predictive probability calculates the probability of outcomes under all possible values of the probability, then creates a weighted average, weighing each probability of success by the probability of that value. The differences corresponding to probability above and below 0.7 approximately balance out, but the former carry a little more weight and so we get roughly what we did before.

If this doesn’t seem right, note that mean and median aren’t the same thing for asymmetric distributions. A beta(7,3) distribution has mean 0.7, but it has a probability of 0.537 of being larger than 0.7.

If our initial experiment has shown 70 successes out of 100 instead of 7 out of 10, the predictive probability of two successes would have been 0.492, closer to the value based on point estimate, but still different.

The further we look ahead, the more difference there is between using a point estimate and using a distribution that incorporates our uncertainty. Here are the probabilities for the number of successes out of the next 100 outcomes, using the point estimate 0.3 and using predictive probability with a beta(7,3) distribution.

So if we’re sure that the probability of success is 0.7, we’re pretty confident that out of 100 trials we’ll see between 60 and 80 successes. But if we model our uncertainty in the probability of response, we get quite a bit of uncertainty when we look ahead to the next 100 subjects. Now we can say that the number of responses is likely to be between 30 and 100.

]]>With data from other distributions, the mean and variance may not be sufficient statistics, and in fact there may be no (useful) sufficient statistics. The full data set is more informative than any summary of the data. But out of habit people may think that the mean and variance are enough.

Probability distributions are an idealization, of course, and so data never exactly “come from” a distribution. But if you’re satisfied with a distributional idealization of your data, there may be useful sufficient statistics.

Suppose you have data with such large outliers that you seriously doubt that it could be coming from anything appropriately modeled as a normal distribution. You might say the definition of sufficient statistics is wrong, that the full data set tells you something you couldn’t know from the summary statistics. But the sample mean and variance are still sufficient statistics in this case. They really are sufficient, *conditional on the normality assumption*, which you don’t believe! The cognitive dissonance doesn’t come from the definition of sufficient statistics but from acting on an assumption you believe to be false.

***

[1] Technically every distribution has sufficient statistics, though the sufficient statistic might be the same size as the original data set, in which case the sufficient statistic hasn’t contributed anything useful. Roughly speaking, distributions have useful sufficient statistics if they come from an “exponential family,” a set of distributions whose densities factor a certain way.

]]>For example, take English letter frequencies. These frequencies are fairly well known. E is the most common letter, followed by T, then A, etc. The string of letters “ETAOIN SHRDLU” comes from the days of Linotype when letters were arranged in that order, in decreasing order of frequency. Sometimes you’d see ETAOIN SHRDLU in print, just as you might see “QWERTY” today.

Morse code is also based on English letter frequencies. The length of a letter in Morse code varies approximately inversely with its frequency, a sort of precursor to Huffman encoding. The most common letter, E, is a single dot, while the rarer letters like J and Q have a dot and three dashes. (So does Y, even though it occurs more often than some letters with shorter codes.)

So how frequently does the letter E, for example, appear in English? That depends on what you mean by English. You can count how many times it appears, for example, in a particular edition of A Tale of Two Cities, but that isn’t the same as it’s frequency in English. And if you’d picked the novel Gadsby instead of A Tale of Two Cities you’d get very different results since that book was written without using a single letter E.

Peter Norvig reports that E accounted for 12.49% of English letters in his analysis of the Google corpus. That’s a better answer than just looking at Gadsby, or even A Tale of Two Cities, but it’s still not *English*.

What might we mean by “English” when discussing letter frequency? Written or spoken English? Since when? American, British, or worldwide? If you mean blog articles, I’ve altered the statistics from what they were a moment ago by publishing this. Introductory statistics books avoid this kind of subtlety by distinguishing between samples and populations, but in this case the population isn’t a fixed thing. When we say “English” as a whole we have in mind some idealization that strictly speaking doesn’t exist.

If we want to say, for example, what the frequency of the letter E is in English as a whole, not some particular English corpus, we can’t answer that to too many decimal places. Nor can we say, for example, which letter is the 18th most frequent. Context could easily change the second decimal place in a letter’s frequency or, among the less common letters, its frequency rank.

And yet, for practical purposes we can say E is the most common letter, then T, etc. We can design better Linotype machines and telegraphy codes using our understanding of letter frequency. At the same time, we can’t expect too much of this information. Anyone who has worked a cryptogram puzzle knows that you can’t say with certainty that the most common letter in a particular sample *must* correspond to E, the next to T, etc.

By the way, Peter Norvig’s analysis suggests that ETAOIN SHRDLU should be updated to ETAOIN SRHLDCU.

]]>

and we can generalize this to the Mittag-Leffler function

which reduces to the exponential function when α = β = 1. There are a few other values of α and β for which the Mittag-Leffler function reduces to more familiar functions. For example,

and

where erfc(*x*) is the complementary error function.

Mittag-Leffler was one person, not two. When I first saw the Mittag-Leffler theorem in complex analysis, I assumed it was named after two people, Mittag and Leffler. But the theorem and the function discussed here are named after one man, the Swedish mathematician Magnus Gustaf (Gösta) Mittag-Leffler (1846–1927).

The function that Mr. Mittag-Leffler originally introduced did not have a β parameter; that generalization came later. The function *E*_{α} is *E*_{α, 1}.

Just as you can make a couple probability distributions out of the exponential function, you can make a couple probability distributions out of the Mittag-Leffler function.

The exponential function exp(-*x*) is positive over [0, ∞) and integrates to 1, so we can define a probability distribution whose density (PDF) function is *f*(*x*) = exp(-*x*) and whose distribution function (CDF) is *F*(*x*) = 1 – exp(-*x*). The Mittag-Leffler distribution has CDF is 1 – *E*_{α}(-*x*^{α}) and so reduces to the exponential distribution when α = 1. For 0 < α < 1, the Mittag-Leffler distribution is a heavy-tailed generalization of the exponential. [1]

The Poisson distribution comes from taking the power series for exp(λ), normalizing it to 1, and using the *k*th term as the probability mass for *k*. That is,

The analogous discrete Mittag-Leffler distribution [2] has probability mass function

In addition to probability and statistics, the the Mittag-Leffler function comes up in fractional calculus. It plays a role analogous to that of the exponential distribution in classical calculus. Just as the solution to the simple differential equation

is exp(*ax*), for 0 < μ < 1, the solution to the **fractional differential equation**

is *a**x*^{μ-1} *E*_{μ, μ}(*a* *x*^{μ}). Note that this reduces to exp(*ax*) when μ = 1. [3]

[1] Gwo Dong Lin. Journal of Statistical Planning and Inference 74 (1998) 1–9, On the Mittag–Leffler distributions

[2] Subrata Chakraborty, S. H. Ong. Mittag-Leffler function distribution: A new generalization of hyper-Poisson distribution. arXiv:1411.0980v1

[3] Keith Oldham, Jan Myland, Jerome Spanier. An Atlas of Functions. Springer.

]]>Some zip code are so sparsely populated that people living in these areas are relatively easy to identify if you have other data. The so-called Safe Harbor provision of HIPAA (Health Insurance Portability and Accountability Act) says that it’s usually OK to include the first three digits of someone’s zip code in de-identified data. But there are 17 areas so thinly populated that even listing the first three digits of their zip code is considered too much of an identification risk. These are areas such that the first three digits of the zip code are:

- 036
- 059
- 063
- 102
- 203
- 556
- 692
- 790
- 821
- 823
- 830
- 831
- 878
- 879
- 884
- 890
- 893

This list could change over time. These are the regions that currently contain fewer than 20,000 people, the criterion given in the HIPAA regulations.

Knowing that someone is part of an area containing 20,000 people hardly identifies them. The concern is that in combination with other information, zip code data is more informative in these areas.

**Related post**: Bayesian clinical trials in one zip code

John Tukey coined many terms that have passed into common use, such as **bit **(a shortening of **binary digit**) and software. Other terms he coined are well known within their niche: **boxplot**, **ANOVA**, **rootogram**, etc. Some of his terms, such as **jackknife** and **vacuum cleaner**, were not new words *per se* but common words he gave a technical meaning to.

**Cepstrum** is an anagram of spectrum. It involves an unusual use of power spectra, and is roughly analogous to making anagrams of a word. A related term, one we will get to shortly, is **quefrency**, an anagram of **frequency**. Some people pronounce the ‘c’ in cepstrum hard (like ‘k’) and some pronounce it soft (like ‘s’).

Let’s go back to an example from my post on guitar distortion. Here’s a note played with a fairly large amount of distortion:

And here is its power spectrum:

There’s a lot going on in the spectrum, but the peaks are very regularly spaced. As I mentioned in the post on the sound of a leaf blower, this is the fingerprint of a sound with a definite pitch. Spikes in the spectrum alone don’t indicate a definite pitch if they are irregularly spaced.

The peaks are fairly periodic. How to you find periodic patterns in a signal? Fourier transform! But if you simply take the Fourier transform of a Fourier transform, you essentially get the original signal back. The key to the cepstrum is to do something else between the two Fourier transforms.

The cepstrum starts by taking the Fourier transform, then the magnitude, then the logarithm, and then the inverse Fourier transform.

When we take the magnitude, we throw away phase information, which we don’t need in this context. Taking the log of the magnitude is essentially what you do when you compute sound pressure level. Some define the cepstrum using the magnitude of the Fourier transform and some the magnitude squared. Squaring only introduces a multiple of 2 once we take logs, so it doesn’t effect the location of peaks, only their amplitude.

Taking the logarithm compresses the peaks, bringing them all into roughly the same range, making the sequence of peaks roughly periodic.

When we take the inverse Fourier transform, we now have something like a frequency, but inverted. This is what Tukey called **quefrency**.

Looking at the guitar power spectrum above, we see a sequence of peaks spaced 440 Hz apart. When we take the inverse Fourier transform of this, we’re looking at a sort of frequency of a frequency, what Tukey calls quefrency. The quefrency scale is inverted: sounds with a high frequency fundamental have overtones that are far apart on the frequency domain, so the sequence of the overtone peaks has *low* frequency.

Here’s the plot of the cepstrum for the guitar sample.

There’s a big peak at 109 on the quefrency scale. The audio clip was recorded at 48000 samples per second, so the 109 on the quefrency scale corresponds to a frequency of 48000/109 = 440 Hz. The second peak is at quefrency 215, which corresponds to 48000/215 = 223 Hz. The second peak corresponds to the perceived pitch of the note, A3, and the first peak corresponds to its first harmonic, A4. (Remember the quefrency scale is inverted relative to the frequency scale.)

I cheated a little bit in the plot above. The very highest peaks are at 0. They are so large that they make it hard to see the peaks we’re most interested in. These low quefrency peaks correspond to very high frequency noise, near the edge of the audible spectrum or beyond.

]]>

I believe rib eye steaks are better for you than rat poison. My basis for that belief is anecdotal evidence. People who have eaten rib eye steaks have fared better than people who have eaten rat poison. I don’t have exact numbers on that, but I’m pretty sure it’s true. I have more confidence in that than in any clinical trial conclusion.

Hearsay evidence about food isn’t very valuable, per observation, but since millions of people have eaten steak for thousands of years, the cumulative weight of evidence is pretty good that steak is harmless if not good for you. The number of people who have eaten rat poison is much smaller, but given the large effect size, there’s ample reason to suspect that eating rat poison is a bad idea.

Now suppose you want to get more specific and determine whether rib eye steaks are good for *you* in particular. (I wouldn’t suggest trying rat poison.) Suppose you’ve noticed that you feel better after eating a steak. Is that an anecdote or data? What if you look back through your diary and noticed that every mention of eating steak lately has been followed by some remark about feeling better than usual. Is that data? What if you decide to flip a coin each day for the next month and eat steak if the coin comes up heads and tofu otherwise. Each of these steps is an improvement, but there’s no magical line you cross between anecdote and data.

Suppose you’re destructively testing the strength of concrete samples. There are better and worse ways to conduct such experiments, but each sample gives you valuable data. If you test 10 samples and they all withstand two tons of force per square inch, you have good reason to believe the concrete the samples were taken from can withstand such force. But if you test a drug on 10 patients, you can’t have the same confidence that the drug is effective. Human subjects are more complicated than concrete samples. Concrete samples aren’t subject to placebo effects. Also, cause and effect are more clear for concrete. If you apply a load and the sample breaks, you can assume the load caused the failure. If you treat a human for a disease and they recover, you can’t be as sure that the treatment caused the recovery. That doesn’t mean medical observations aren’t data.

Carefully collected observations in one area may be less statistically valuable than anecdotal observations in another. Observations are never ideal. There’s always some degree of bias, effects that can’t be controlled, etc. There’s no quantum leap between useless anecdotes and perfectly informative data. Some data are easy to draw inference from, but data that’s harder to understand doesn’t fail to be data.

]]>