Uncategorized

Toxic pairs, re-identification, and information theory

Database fields can combine in subtle ways. For example, nationality is not usually enough to identify anyone. Neither is religion. But the combination of nationality and religion can be surprisingly informative.

Information content of nationality

How much information is contained in nationality? That depends on exactly how you define nations versus territories etc., but for this blog post I’ll take this Wikipedia table for my raw data. You can calculate that nationality has entropy of 5.26 bits. That is, on average, nationality is slightly more informative than asking five independent yes/no questions. (See this post for how to calculate information content.)

Entropy measures expected information content. Knowing that someone is from India (population 1.3 billion) carries only 2.50 bits of information. Knowing that someone is from Vatican City (population 800) carries 23.16 bits of information.

One way to reduce the re-identification risk of PII (personally identifiable information) such as nationality is to combine small categories. Suppose we lump all countries with a population under one million into “other.” Then we go from 240 categories down to 160. This hardly makes any difference to the entropy: it drops from 5.26 bits to 5.25 bits. But the information content for the smallest country on the list is now 8.80 bits rather than 23.16.

Information content of religion

What about religion? This is also subtle to define, but again I’ll use Wikipedia for my data. Using these numbers, we get an entropy of 2.65 bits. The largest religion, Christianity, has an information content 1.67 bits. The smallest religion on the list, Rastafari, has an information content of 13.29 bits.

Joint information content

So if nationality carries 5.25 bits of information and religion 2.65 bits, how much information does the combination of nationality and religion carry? At least 5.25 bits, but no more than 5.25 + 2.65 = 7.9 bits on average. For two random variables X and Y, the joint entropy H(X, Y) satisfies

max( H(X), H(Y) ) ≤ H(X, Y) ≤ H(X) + H(Y)

where H(X) and H(Y) are the entropy of X and Y respectively.

Computing the joint entropy exactly would require getting into the joint distribution of nationality and religion. I’d rather not get into this calculation in detail, except to discuss possible toxic pairs. On average, the information content of the combination of nationality and religion is no more than the sum of the information content of each separately. But particular combinations can be highly informative.

For example, there are not a lot of Jews living in predominantly Muslim countries. According to one source, there are at least five Jews living in Iraq. Other sources put the estimate as “less than 10.” (There are zero Jews living in Libya.)

Knowing that someone is a Christian living in Mexico, for example, would not be highly informative. But knowing someone is a Jew living in Iraq would be extremely informative.

More information

Why don’t you simply use XeTeX?

From an FAQ post I wrote a few years ago:

This may seem like an odd question, but it’s actually one I get very often. On my TeXtip twitter account, I include tips on how to create non-English characters such as using \AA to produce Å. Every time someone will ask “Why not use XeTeX and just enter these characters?”

If you can “just enter” non-English characters, then you don’t need a tip. But a lot of people either don’t know how to do this or don’t have a convenient way to do so. Most English speakers only need to type foreign characters occasionally, and will find it easier, for example, to type \AA or \ss than to learn how to produce Å or ß from a keyboard. If you frequently need to enter Unicode characters, and know how to do so, then XeTeX is great.

One does not simply type Unicode characters.

Related posts:

Team dynamics and encouragement

When you add people to a project, the total productivity of the team as a whole may go up, but the productivity per person usually goes down. Someone suggested that as a rule of thumb, a company needs to triple its number of employees to double its productivity. Fred Brooks summarized this saying

“Many hands make light work” — Often
But many hands make more work — Always

I’ve seen this over and over. But I think I’ve found an exception. When work is overwhelming, a lot of time is absorbed by discouragement and indecision. In that case, new people can make a big improvement. They not only get work done, but they can make others feel more like working.

Flood cleanup is like that, and that’s what motivated this note. Someone new coming by to help energizes everyone else. And with more people, you see progress sooner and make more progress, in a sort of positive feedback loop.

This is all in the context of fairly small teams. There must be a point where adding more people decreases productivity per person or even total productivity. I’ve heard reports of a highly bureaucratic relief organization that makes things worse when they show up to “help.” The ideal team size is somewhere between a couple discouraged individuals and a bloated bureaucracy.

Related post: Optimal team size

Relearning from a new perspective

I had a conversation with someone today who said he’s relearning logic from a categorical perspective. What struck me about this was not the specifics but the pattern:

Relearning _______ from a _______ perspective.

Not relearning something forgotten, but going back over something you already know well, but from a different starting point, a different approach, etc.

Have any experiences along these lines you’d like to share in the comments? Anything you have relearned, attempted to relearn, or would like to relearn from a new angle?

Hurricane Harvey update

As you may know, I live in the darkest region of the rainfall map below.

Hurricane Harvey rainfall map

My family and I are doing fine. Our house has not flooded, and at this point it looks like it will not flood. We’ve only lost electricity for a second or two.

Of course not everyone in Houston is doing so well. Harvey has done tremendous damage. Downtown was hit especially hard, and apparently they are in for more heavy rain. But it looks like the worst may be over for my area.

Update (5:30 AM, August 28): More flooding overnight, some of it near by. We’re still OK. It looks like the heaviest rain is over, but there’s still rain in the forecast and there’s no place for more rain to go.

Houston has two enormous reservoirs west of town that together hold about half a billion cubic meters of water. This morning they started releasing water from the reservoirs to prevent dams from breaking.

Space City Weather has been the best source of information. The site offers “hype-free forecasts for greater Houston.” It’s a shame that a news source should have to describe itself as “hype-free,” but they are indeed hype-free and other sources are not.

Update (August 29): Looks like the heavy rain is over. We’re expecting rain for a few more days, but the water is receding faster than it’s collecting, at least on the northwest side.

Solving problems we wish we had

There’s a great line from Heather McGaw toward the end of the latest episode of 99 Percent Invisible:

Sometimes … we can start to solve problems that we wish were problems because they’re easy to solve.

Reminds me of an excerpt from Richard Weaver’s book Ideas Have Consequences:

Obsession, according to the canons of psychology, occurs when an innocuous idea is substituted for a painful one. The victim simply avoids recognizing the thing which will hurt. We have seen that the most painful confession for the modern egoist to make is that there is a center or responsibility. He has escaped it by taking his direction with reference to the smallest points. … The obsession, however, is a source of great comfort to the obsessed.

Subscribing by email

You can subscribe to my blog by email or RSS. I also have a brief newsletter you could sign up for. There are links to these in the sidebar of the blog:

subscription options

If you subscribe by email, you’ll get an email each morning containing the post(s) from the previous day.

I just noticed a problem with email subscription: it doesn’t show SVG images, at least when reading via Gmail; maybe other email clients display SVG correctly. Here’s what a portion of yesterday’s email looks like in Gmail:

screen shot of missing image

I’ve started using SVG for graphs, equations, and a few other images. The main advantage to SVG is that the images look sharper. Also, you can display the same image file at any resolution; no need to have different versions of the image for display at different sizes. And sometimes SVG files are smaller than their raster counterparts.

There may be a way to have web site visitors see SVG and email subscribers see PNG. If not, email subscribers can click on the link at the top of each post to open it in a browser and see all the images.

By the way, RSS readers handle SVG just fine. At least Digger Reader, the RSS reader I use, works well with SVG. The only problem I see is that centered content is always moved to the left.

* * *

The email newsletter is different from the email blog subscription. I only send out a newsletter once a month. It highlights the most popular posts and says a little about what I’ve been up to. I just sent out a newsletter this morning, so it’ll be another month before the next one comes out.

Color theory questions

Here’s a script I wanted to write: given a color c specified in RGB and an angle θ, rotate c on the color wheel by θ and return the RGB value of the result.

You can’t rotate RGB values per se, but you can rotate hues. So my initial idea was to convert RGB to HSV or HSL, rotate the H component, then convert back to RGB. There are some subtleties with converting between RGB and either HSV or HSL, but I’m willing to ignore those for now.

The problem I ran into was that my idea of a color wheel doesn’t match the HSV or HSL color wheels. For example, I’m thinking of green as the complementary color to red, the color 180° away. On the HSV and HSL color wheels, the complementary color to red is cyan. The color wheel I have in mind is the “artist’s color wheel” based on the RYB color space, not RGB. Subtractive color, not additive.

This brings up several questions.

  • How do you convert back and forth between RYB and RGB?
  • How do you describe the artist’s color wheel mathematically, in RYB or any other system?
  • What is a good reference on color theory? I’d like to understand in detail how the various color systems relate, something that spans the gamut (pun intended) from an artist’s perspective down to physics and physiology.

Student’s future, teacher’s past

“Teachers should prepare the student for the student’s future, not for the teacher’s past.” — Richard Hamming

I ran across the above quote from Hamming this morning. It made me wonder whether I tried to prepare students for my past when I used to teach college students.

How do you prepare a student for the future? Mostly by focusing on skills that will always be useful, even as times change: logic, clear communication, diligence, etc.

Negative forecasting is more reliable here than positive forecasting. It’s hard to predict what’s going to be in demand in the future (besides timeless skills), but it’s easier to predict what’s probably not going to be in demand. The latter aligns with Hamming’s exhortation not to prepare students for your past.

Changing your mind

From Dorothy Sayers’ essay Why Work?

It is always strange and painful to have to change a habit of mind; though, when we have made the effort, we may find a great relief, even a sense of adventure and delight, in getting rid of the false and returning to the true.

Cauchy, Benford, and a problem with NHST

Introduction

Samples from a Cauchy distribution nearly follow Benford’s law. I’ll demonstrate this below. The more data you see, the more confident you should be of this. But with a typical statistical approach, crudely applied NHST (null hypothesis significance testing), the more data you see, the less convinced you are.

This post assumes you’ve read the previous post that explains what Benford’s law is and looks at how well samples from a Weibull distribution follow that law.

This post has two purposes. First, we show that samples from a Cauchy distribution approximately follow Benford’s law. Second, we look at problems with testing goodness of fit with NHST.

Cauchy data

We can reuse the code from the previous post to test Cauchy samples, with one modification. Cauchy samples can be negative, so we have to modify our leading_digit function to take an absolute value.

      def leading_digit(x):
          y = log10(abs(x)) % 1
          return int(floor(10**y))

We’ll also need to import cauchy from scipy.stats and change where we draw samples to use this distribution.

      samples = cauchy.rvs(0, 1, N)

Here’s how a sample of 1000 Cauchy values compared to the prediction of Benford’s law:

|---------------+----------+-----------|
| Leading digit | Observed | Predicted |
|---------------+----------+-----------|
|             1 |      313 |       301 |
|             2 |      163 |       176 |
|             3 |      119 |       125 |
|             4 |       90 |        97 |
|             5 |       69 |        79 |
|             6 |       74 |        67 |
|             7 |       63 |        58 |
|             8 |       52 |        51 |
|             9 |       57 |        46 |
|---------------+----------+-----------|

Here’s a bar graph of the same data.Bar graph of Cauchy leading digits compared to Benford's law

Problems with NHST

A common way to measure goodness of fit is to use a chi-square test. The null hypothesis would be that the data follow a Benford distribution. We look at the chi-square statistic for the observed data, based on a chi-square distribution with 8 degrees of freedom (one less than the number of categories, which is 9 because of the nine digits). We compute the p-value, the probability of seeing a chi-square statistic this larger or larger, and reject our null hypothesis if this p-value is too small.

Here’s how our chi-square values and p-values vary with sample size.

|-------------+------------+---------|
| Sample size | chi-square | p-value |
|-------------+------------+---------|
|          64 |     13.542 |  0.0945 |
|         128 |     10.438 |  0.2356 |
|         256 |     13.002 |  0.1118 |
|         512 |      8.213 |  0.4129 |
|        1024 |     10.434 |  0.2358 |
|        2048 |      6.652 |  0.5745 |
|        4096 |     15.966 |  0.0429 |
|        8192 |     20.181 |  0.0097 |
|       16384 |     31.855 | 9.9e-05 |
|       32768 |     45.336 | 3.2e-07 |
|-------------+------------+---------|

The p-values eventually get very small, but they don’t decrease monotonically with sample size. This is to be expected. If the data came from a Benford distribution, i.e. if the null hypothesis were true, we’d expect the p-values to be uniformly distributed, i.e. they’d be equally likely to take on any value between 0 and 1. And not until the two largest samples do we see values that don’t look consistent with uniform samples from [0, 1].

In one sense NHST has done its job. Cauchy samples do not exactly follow Benford’s law, and with enough data we can show this. But we’re rejecting a null hypothesis that isn’t that interesting. We’re showing that the data don’t exactly follow Benford’s law rather than showing that they do approximately follow Benford’s law.

Click to learn more about Bayesian statistics consulting

What personality classifications have in common

There are many ways to divide people into four personality types, from the classical—sanguine, choleric, melancholic, and phlegmatic—to contemporary systems such as the DISC profile. The Myers-Briggs system divides people into sixteen personality types. I just recently ran across the “enneagram,” an ancient system for dividing people into nine categories.

There’s one thing advocates of all the aforementioned systems agree on: the number of basic personality types is a perfect square.