Uncovering names masked with stars

Sometimes I’ll see things like my name partially concealed as J*** C*** and think “a lot of good that does.”

Masking letters reveals more than people realize. For example, when you see that someone’s first name is four letters and begins with J, there’s about a 70% chance they’re male and there’s a 44% chance they’re named John. If you know this person is male, there’s a 63% chance they’re name is John.

If you know a man’s name has the form J***, his name isn’t necessarily John, though that’s the most likely possibility. There’s a 8% chance his name is Jack and a 6% chance his name is Joel.

All these numbers depend on the data set you’re looking at, but these are roughly accurate numbers for looking at any representative sample of American names.

Some names stand out more than others. If I tell you someone’s name is E********, there’s a 90% chance the name is Elizabeth.

If I tell you someone’s name is B*****, there’s a 77% chance this person is female, but it’s harder to guess which name is hers. The most likely possibility is Brenda, but there are several other possibilities that are fairly likely: Bonnie, Brooke, Brandy, etc.

We could go through a similar exercise with last names. You can probably guess who S**** is, though C***** is not so clear.

In short, replacing letters with stars doesn’t do much to conceal someone’s name. It usually doesn’t let you infer someone’s name with certainty, but it definitely improves your chances of guessing correctly. If you have a few good guesses as to someone’s name, and some good guesses on a handful of other attributes, together you have a good chance of identifying someone.

Related posts

Almost ASCII

I was working recently with a gigabyte file that had a dozen non-ASCII characters. This is very common. The ASCII character set is not quite big enough for a lot of tasks. Of course it’s completely inadequate if you’re writing Japanese, but it’s almost enough for documents written in English and a few other languages.

Efficient encoding

The world has standardized on Unicode as the way to represent characters across languages. Unicode currently has around 150,000 characters, far more than ASCII’s 128 characters.

But there’s a problem. Since 150,000 > 217, it takes more than two bytes (eight bits to a byte) to represent each of 150,000 things. If you use three bytes to represent each character, every file that is almost all ASCII will get three times bigger. If you limit yourself to the most frequently used Unicode characters, those that can be represented with two bytes (the “basic multilingual plane”), then you still double the size of files.

Enter UTF-8, a brilliant solution to this problem. The UTF-8 encoding of an ASCII file is an ASCII file. Pure ASCII files don’t get any larger when interpreted as UTF-8 encoded Unicode. Because 128 = 27, a byte representing an ASCII character has one unused bit. UTF-8 uses this unused bit to signal that what follows is not ASCII. I wrote about the full details here.

Unicode characters outside the ASCII range take 2, 3, or 4 bytes to represent. Inserting a small number of non-ASCII characters into a UTF-8 encoded Unicode file hardly changes the file’s size.

Troubleshooting

I mentioned at the top that I had a gigabyte file with a dozen non-ASCII characters. The command file -I reported the file encoding to be ASCII, because the vast majority of the file was ASCII. But the non-ASCII characters were not valid Unicode characters either.

These invalid Unicode characters would display as �, which is not actually in the file. The � is a valid Unicode character for representing an invalid Unicode character.

Some of the non-ASCII characters where extended ASCII (Windows 1252) characters, but if I remember correctly even that didn’t account for everything. Some of the odd characters were simply file corruption.

It’s kinda interesting how some tools are robust to these kinds of glitches and some are not. My first clue that something funny was going on was when sort refused to sort. I ran a Python script that helps me fix wonky text files and it threw an error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 222662: invalid start byte

This may seem like gibberish, but it actually says exactly what’s going on. There was an error interpreting the file as Unicode, because 0x92 is not a valid way to start a non-ASCII character in UTF-8.

The first bit of an ASCII character is 0. The first two bits of a non-ASCII character in UTF-8 are 11. But 9 is 1001 in binary, i.e. it starts with 10, and so the byte 0x92 is neither an ASCII character nor the beginning of a UTF-8 non-ASCII sequence of bytes. More details here.

Removing non-ASCII characters

For my application I could just remove the invalid characters using iconv with the -c option.

iconv -c -f CP1252 -t UTF-8 inputfile > outputfile

If you need to salvage troublesome characters then things are a little more complicated. The iconv utility will work if you know what the intended encoding was. If you don’t the intended encoding, you may need to do some detective work.

Related posts

A knight’s tour of an infinite chessboard

Let ℤ² be the lattice of points in the plane with integer coordinates. You could think of these points as being the centers of the squares in a chessboard extending to infinity in every direction.

Cantor tells us that the points in ℤ² are countable. What’s more surprising is that you could count the points using a knight’s move. If you place a knight at the origin, there is a way for him to visit every point exactly once.

It’s possible to tour every point in a 5 × 5 lattice as shown below.

The knight’s next move would take it out of the 5 × 5 block and position it to begin touring a new 5 × 5 block bordering the first block.

You can repeat this pattern, covering ℤ² in a spiral of 5 × 5 squares.

I produced the first two images using Quiver, an application intended for drawing commutative diagrams, though I’ve found it useful for other drawing tasks as well.

The third image was taken from a note by Robert E. Gilman on an article by Norman Anning: A Recreation. The American Mathematical Monthly, Vol. 37, No. 10 (Dec., 1930), pp. 535–538.

Related posts

Natural one-liners

I learned to use Unix in college—this was before Linux—but it felt a little mysterious. Clearly it was developed by really smart people, but what were the problems that motivated their design choices?

Some of these are widely repeated. For example, commands have terse names because you may have to transmit commands over a glacial 300 baud network connection.

OK, but why are there so many tools for munging text files, for example? That’s great if your job requires munging text files, but what about everything else. What I didn’t realize at the time was that nearly everything involves munging text files, or can be turned into a problem involving munging text files.

Working with data at the command line

There’s an old joke that Unix is user friendly, it’s just picky about who its friends are. I’d rephrase to say Unix makes more sense when you’re doing the kind of work the Unix developers were doing.

I was writing programs when I learned Unix, so some things about Unix made sense at the time. But I didn’t see the motivation for many of the standard command line tools until I started analyzing datasets years later. I thought awk was cool—it was the first scripting language I encountered—but it wasn’t until years later that I realized awk is essentially a command line spreadsheet. It was written for manipulating tabular data stored in text files.

Mythological scripts

Unix one-liners are impressive, but they can seem like a rabbit out of a hat. How would anyone think to do that?

When you develop your own one liners, one piece at a time, they seem much more natural. You get a feel for how the impressive one-liners you see on display were developed incrementally. They almost certainly did not pop into the world fully formed like Athena springing from the head of Zeus.

Example: Voter registration data

Here’s an example. I was looking at Washington state voter registration data. There’s a file 20240201_VRDB_Extract.txt. What’s in there?

The first line of a data file often contains column headers. Looking at just the first few lines of a file is a perennial task, so there’s a tool for that: head. By default it shows the first 10 lines of a file. We just want to see the first line, and there’s an option for that: -n 1.

> head -n 1 20240201_VRDB_Extract.txt

StateVoterID|FName|MName|LName|NameSuffix|Birthyear|Gender|RegStNum|RegStFrac|RegStName|RegStType|RegUnitType|RegStPreDirection|RegStPostDirection|RegStUnitNum|RegCity|RegState|RegZipCode|CountyCode|PrecinctCode|PrecinctPart|LegislativeDistrict|CongressionalDistrict|Mail1|Mail2|Mail3|MailCity|MailZip|MailState|MailCountry|Registrationdate|LastVoted|StatusCode

Inserting line breaks

OK, those look like column headers, but they’re hard to read. It would be nice if we could replace all the pipe characters used as field separators with line breaks. There’s a command for that too. The sed tool let’s you, among other things, replace one string with another. The tiny sed program

s/|/\n/g

does just what we want. It may look cryptic, but it’s very straight forward. The “s” stands for substitute. The program s/foo/bar/ substitutes the first instance of foo with bar. If you want to replace all instances, you tack on a “g” on the end for “global.”

Eliminating temporary files

We could save our list of column headings to a file, and then run sed on the output, but that creates an unnecessary temporary file. If you do this very much, you get a lot of temporary files cluttering your working area, say with names like temp1 and temp2. Then after a while you start to forget what you named each intermediary file.

It would be nice if you could connect your processing steps together without having to create intermediary files. And that’s just what pipes do. So instead of saving our list of column headers to a file, we pipe it through to sed.

> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g'

FName
MName
LName
NameSuffix
Birthyear
...

Scrolling and searching

This is much better. But it produces more output than you may be able to see in your terminal. You could see the list, one terminal window at a time, by piping the output to less.

> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g' | less

This file only has 33 columns, but it’s not uncommon for a data file to have hundreds of columns. Suppose there were more columns than you wanted to scan through, and you wanted to know whether one of the columns contained a zip code. You could do that by piping the output through grep to look for “zip.”

> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g' | grep -i zip

There are no column headings containing “zip”, but there are a couple containing “Zip.” Adding -i (for case insensitive) finds the zip code columns.

RegZipCode
MailZip

Our modest little one-liner now has three segments separated by pipes. It might look impressive to someone new to working this way, but it’s really just stringing common commands together in a common way.

A famous one-liner

When you see a more complicated one-liner like

tr -cs A-Za-z '
' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

you can imagine how it grew incrementally. Incidentally, the one-liner above is somewhat famous. You can find the story behind it here.

Related posts

When is less data less private?

If I give you a database, I give you every row in the database. So if you delete some rows from the database, you have less information, not more, right?

This seems very simple, and it mostly is, but there are a couple subtleties.

A common measure in data privacy is k-anonymity. The idea is that if at least k individuals in a data set share some set of data values, and k is large enough, then the privacy of those individuals is protected.

Now suppose you randomly select a single record from a database that was deemed deidentified because it satisfied k-anonymity with k = 10. Now your new dataset, consisting of only one record, is k-anonymous with k = 1: every record is unique because there’s only one record. But how is this person’s data any less private that it was before?

Note that I said above that you selected a record at random. If you selected the row using information that you know but which isn’t in the database, you might have implicitly added information. But if you select a subset of data, using only information explicit in that data, you haven’t added information.

Here’s where k-anonymity breaks down. The important measure is k-anonymity in the general population, not k-anonymity in a data set, unless you know that someone is in the data set.

If you find someone named John Cook in a data set, you probably haven’t found my information, even if there is only one person by that name in the data set. My name may or may not be common in that particular data set, but my name is common in general.

The number of times a combination of data fields gives a lower bound on how often the combination appears in general, so k-anonymity in a data set is a good sign for privacy, but the lack of k-anonymity is not necessarily a bad sign. The latter could just be an artifact of having a small data set.

Additive functions

A function f from positive integers to real numbers is defined to be additive if for relatively prime numbers m and n,

f(mn) = f(m) + f(n).

The function f is called completely addititive if the above holds for all positive integers m and n, i.e. we drop the requirement that m and n are relatively prime.

Example: total prime factors

One example of an additive function is the function Ω(n) defined to be the number of prime factors of n, counted with multiplicity. For example, Ω(12) = 3 because 12 = 2 × 2 × 3. The numbers 10 and 63 are relatively prime, and

Ω(630) = 5 = Ω(10) + Ω(63).

Example: distinct prime factors

Another example of an additive function is ω(n) defined to be the number of distinct prime factors of n, i.e. not counting with multiplicity. So, for example, ω(12) = 2.

This function is additive but not completely additive because, for example,

ω(20) = 2 ≠ ω(2) + ω(10)  = 3

A theorem of Erdős

Here is a remarkable theorem due to Paul Erdős [1]. Suppose f is an additive function such that

f(n + 1) − f(n)

converges to zero as n goes to infinity. Then

f(n) = c log(n)

for some constant c. And since a multiple of a logarithm is a logarithm to a different base, we can restate the conclusion by simply saying f is a logarithm.

Logarithms are completely additive functions, so even though we only assumed f was additive, this combined with the limit condition proves that in fact f is completely additive.

Related posts

[1] Paul Erdős, “On the distribution function of additive functions,” Ann. of Math., Vol. 47 (1946), pp. 1–20.

Frequency analysis

Suppose you have a list of encrypted surnames names of US citizens. If the list is long enough, the encrypted name that occurs most often probably corresponds to Smith. The second most common encrypted name probably corresponds to Johnson, and so forth. This kind of inference is analogous to solving a cryptogram puzzle by counting letter frequencies.

The probability of correctly guessing the most common names based on frequency analysis depends critically in the sample size. In a small sample, there may be no Smiths. In a larger sample, the name Smith may be common, but not the most common.

I did some simulations to estimate how well frequency analysis would work at identifying the 10 most common names as a function of the sample size N. For each N, I simulated 100 data sets using probabilities derived from the surname frequencies derived from US Census Bureau data.

When N = 1,000, there was a 53% chance that the most common name in the population, Smith, would be the most common name in the sample. The second most common name in the population, Johnson, was the second most common name in the sample only 14% of the time.

When N = 10,000, there was a 94% chance of identifying Smith, and at least a 30% chance of identifying the five most common names.

When N = 1,000,000, the three most common names were identified every time in the simulation. And each of the 10 most common names were correctly identified most of the time. In fact, the 18 most common names were correctly identified most of the time.

A consequence of this analysis is that hashing names does not protect privacy if the sample size is large. Hashing names along with other information, so that the combined data has a more uniform distribution, may protect privacy.

Related posts

Security by obscurity

Security-by-obscurity is a bad idea in general. It’s better, for example, to have a login page than to give your site an obscure URL. It’s better to encrypt a file than to hide it in some odd directory. It’s better to use a well-vetted encryption algorithm than to roll your own.

There there are people whose knee-jerk reaction to any form of obscurity is to shout “That’s security-by-obscurity!” but obscurity can be subtle.

All else being equal, adding a layer of obscurity doesn’t hurt. For example, you can literally make a public encryption key public, as I’ve done here. But for extra security, why distribute your encryption key more widely than necessary? And if your message is adequately encrypted, you could in principle publish it for the world to see. But why not just give it to the intended recipient?

The public key on my site is there for strangers to contact me, but if I were really concerned about secure communication between colleagues, I’d just circulate the key among those colleagues. That may not be much more secure, but surely it’s no less secure. And I’d share messages privately, even though they are encrypted.

It’s good to look closely at any argument that beings “all else being equal” to see if all else is indeed equal. A more nuanced objection to security-by-obscurity is that it can create a false sense of security.

One could argue, for example, that making your public key available to the world forces you to be more careful about your encryption. Maybe you’ve been using an RSA key for years, and you really should use a longer key, but you don’t because you can argue that not many people have your public key anyway. But if your key’s too sort, obscuring your public key doesn’t help.

And while it’s better to deliver encrypted messages privately, it helps to not count on this, to assume that the encrypted message might be made public. That’s the basic premise behind encryption.

The principle behind no-security-by-obscurity is that you want to concentrate your security where it can be quantified. You can, for example, quantify how much more effort it would take to break a 64-bit key (like Blowfish) than a 56-bit key (like DES). Or even better, a 128-bit key (like AES). But you can’t quantify the level of protection that comes from obscurity.

Is it more secure to give someone a 56-bit DES key on a flash drive in a dark alley than to send them a 64-bit Blowfish key over SMS You can’t calculate an answer to that question.

In some sense all security is by obscurity. Cryptography literally means hidden writing. But all else being equal—there’s that phrase again—you want to minimize the surface area of what you have to obscure, e.g. limiting your secret to your key and not your algorithm, and it’s better to have quantified risks than unquantified risks. But all else is often not equal, and there are difficult trade-offs.

Related posts

Advanced questions about a basic diagram

Unit circle trig diagram

I saw a hand-drawn version of the diagram above yesterday and noticed that the points were too evenly distributed. That got me to thinking: is there any objective way to say that this famous diagram is in some sense complete? If you were to make a diagram with more points, what would they be?

Simple numbers

The numbers on the diagram are all simple. Once we’re more precise about what it means for these numbers to be “simple,” we can answer the questions above.

The angles in the diagram are all rational parts of a circle, that is, rational multiples of 2π. For the rest of the post, I’ll say “rational angle” to mean a rational multiple of 2π.

The sines and cosines all involve only one square root, i.e. no nested roots. A more useful way to express this is that all the values are the roots of a quadratic polynomial with integer coefficients.

Completeness

Could we add more rational angles whose sines and cosines are roots of quadratics? Maybe the chart would be too cluttered to put in a textbook, but would it be possible in principle? Could there be some chart analogous to the one above that has, for example, (1 + √7)/5 as one of the labels?

The angles in the common unit circle diagram, integer multiples of π/4 and π/6, are the only rational angles with sines and cosines that are roots of a quadratic polynomial with integer coefficients. That is, these are the only rational angles that have sines and cosines that are algebraic of degree 2. In that sense the diagram is complete.

The number (1 + √7)/5 is algebraic of degree 2 [1] but isn’t on our exhaustive list of possible algebraic values of degree 2. So even if you were to try numbers of the form pπ/q for very large integers p and q, you’ll never get a sine or cosine equal to (1 + √7)/5.

In 1933 Lehmer [2] showed how to classify all rational angles whose sines or cosines are algebraic of given degree. His theorem proves that the only rational angles whose sine is algebraic of degree 2 are integer multiples of π/4 and π/6.

Interestingly, there is another rational angle whose cosine is algebraic of degree 2:

cos(π/5) = (1 + √5)/4

So we could extend the unit circle diagram to include multiples of π/5, but only the cosine would be algebraic of degree 2. The sines are more complicated. For example,

sin(π/5) = √(5/8 + √(5)/8)

which is algebraic of degree 4.

Higher degrees

There are no rational angles whose sine is algebraic of degree 3, so going up to degree 3 wouldn’t help.

If we go up to degree 4 then we could add multiples of π/5, π/8, and π/12. These all have sines and cosines that are algebraic of degree 4.

Related posts

[1] (1 + √7)/5  is a root of 25x² − 10x = 6 = 0.

[2] D. H. Lehmer. A Note on Trigonometric Algebraic Numbers. The American Mathematical Monthly , March 1933, Vol. 40, No. 3, pp. 165–166

How much metadata is in a photo?

A few days ago I wrote about the privacy implications of metadata in a PDF. This post will do the same for photos.

Dalek on a Seattle train

You can see the metadata in a photo using exiftool. By default cameras include time and location data. I ran this tool on a photo I took in Seattle a few years ago when I was doing some work for Amazon. The tool reported 114 fields, some of which are redundant. Here is some of the information contained in the metadata.

GPS Altitude  : 72.5 m Above Sea Level
GPS Date/Time : 2017:05:05 17:47:33.31Z
GPS Position  : 47 deg 36' 39.71" N, 122 deg 19' 59.40" W
Lens ID       : iPhone SE back camera 4.15mm f/2.2

How finely does this specify the location? The coordinates are given to 1/100 of a second, so 1/360000 of a degree. A degree of latitude is 111 km, so the implied accuracy is on the order of 30 cm or one foot, whether that’s correct or not.

You can look up that ground level at that location is 46 meters above sea level, which would imply the photo was taken on the 8th floor of a building. (It clearly wasn’t. Either the elevation of ground level or the elevation recorded in the phone isn’t correct.)

When I cropped the image, the edited image contained the software and operating system that was used to edit it.

Platform    : Linux
Software    : GIMP 2.10.30
Modify Date : 2024:02:13 08:39:49

This shows that I edited the image this morning using GIMP installed on a Linux box.

You can change your phone’s settings to not include location data in photos. If you do, the photos may still include the time zone, which is a weak form of location data. You can remove some or all the metadata later using image editing software, but by default a photo reveals more than you may intend.

More metadata posts

Related posts