Uncovering names masked with stars

Sometimes I’ll see things like my name partially concealed as J*** C*** and think “a lot of good that does.”

Masking letters reveals more than people realize. For example, when you see that someone’s first name is four letters and begins with J, there’s about a 70% chance they’re male and there’s a 44% chance they’re named John. If you know this person is male, there’s a 63% chance they’re name is John.

If you know a man’s name has the form J***, his name isn’t necessarily John, though that’s the most likely possibility. There’s a 8% chance his name is Jack and a 6% chance his name is Joel.

All these numbers depend on the data set you’re looking at, but these are roughly accurate numbers for looking at any representative sample of American names.

Some names stand out more than others. If I tell you someone’s name is E********, there’s a 90% chance the name is Elizabeth.

If I tell you someone’s name is B*****, there’s a 77% chance this person is female, but it’s harder to guess which name is hers. The most likely possibility is Brenda, but there are several other possibilities that are fairly likely: Bonnie, Brooke, Brandy, etc.

We could go through a similar exercise with last names. You can probably guess who S**** is, though C***** is not so clear.

In short, replacing letters with stars doesn’t do much to conceal someone’s name. It usually doesn’t let you infer someone’s name with certainty, but it definitely improves your chances of guessing correctly. If you have a few good guesses as to someone’s name, and some good guesses on a handful of other attributes, together you have a good chance of identifying someone.

Related posts

When is less data less private?

If I give you a database, I give you every row in the database. So if you delete some rows from the database, you have less information, not more, right?

This seems very simple, and it mostly is, but there are a couple subtleties.

A common measure in data privacy is k-anonymity. The idea is that if at least k individuals in a data set share some set of data values, and k is large enough, then the privacy of those individuals is protected.

Now suppose you randomly select a single record from a database that was deemed deidentified because it satisfied k-anonymity with k = 10. Now your new dataset, consisting of only one record, is k-anonymous with k = 1: every record is unique because there’s only one record. But how is this person’s data any less private that it was before?

Note that I said above that you selected a record at random. If you selected the row using information that you know but which isn’t in the database, you might have implicitly added information. But if you select a subset of data, using only information explicit in that data, you haven’t added information.

Here’s where k-anonymity breaks down. The important measure is k-anonymity in the general population, not k-anonymity in a data set, unless you know that someone is in the data set.

If you find someone named John Cook in a data set, you probably haven’t found my information, even if there is only one person by that name in the data set. My name may or may not be common in that particular data set, but my name is common in general.

The number of times a combination of data fields gives a lower bound on how often the combination appears in general, so k-anonymity in a data set is a good sign for privacy, but the lack of k-anonymity is not necessarily a bad sign. The latter could just be an artifact of having a small data set.

Frequency analysis

Suppose you have a list of encrypted surnames names of US citizens. If the list is long enough, the encrypted name that occurs most often probably corresponds to Smith. The second most common encrypted name probably corresponds to Johnson, and so forth. This kind of inference is analogous to solving a cryptogram puzzle by counting letter frequencies.

The probability of correctly guessing the most common names based on frequency analysis depends critically in the sample size. In a small sample, there may be no Smiths. In a larger sample, the name Smith may be common, but not the most common.

I did some simulations to estimate how well frequency analysis would work at identifying the 10 most common names as a function of the sample size N. For each N, I simulated 100 data sets using probabilities derived from the surname frequencies derived from US Census Bureau data.

When N = 1,000, there was a 53% chance that the most common name in the population, Smith, would be the most common name in the sample. The second most common name in the population, Johnson, was the second most common name in the sample only 14% of the time.

When N = 10,000, there was a 94% chance of identifying Smith, and at least a 30% chance of identifying the five most common names.

When N = 1,000,000, the three most common names were identified every time in the simulation. And each of the 10 most common names were correctly identified most of the time. In fact, the 18 most common names were correctly identified most of the time.

A consequence of this analysis is that hashing names does not protect privacy if the sample size is large. Hashing names along with other information, so that the combined data has a more uniform distribution, may protect privacy.

Related posts

How much metadata is in a photo?

A few days ago I wrote about the privacy implications of metadata in a PDF. This post will do the same for photos.

Dalek on a Seattle train

You can see the metadata in a photo using exiftool. By default cameras include time and location data. I ran this tool on a photo I took in Seattle a few years ago when I was doing some work for Amazon. The tool reported 114 fields, some of which are redundant. Here is some of the information contained in the metadata.

GPS Altitude  : 72.5 m Above Sea Level
GPS Date/Time : 2017:05:05 17:47:33.31Z
GPS Position  : 47 deg 36' 39.71" N, 122 deg 19' 59.40" W
Lens ID       : iPhone SE back camera 4.15mm f/2.2

How finely does this specify the location? The coordinates are given to 1/100 of a second, so 1/360000 of a degree. A degree of latitude is 111 km, so the implied accuracy is on the order of 30 cm or one foot, whether that’s correct or not.

You can look up that ground level at that location is 46 meters above sea level, which would imply the photo was taken on the 8th floor of a building. (It clearly wasn’t. Either the elevation of ground level or the elevation recorded in the phone isn’t correct.)

When I cropped the image, the edited image contained the software and operating system that was used to edit it.

Platform    : Linux
Software    : GIMP 2.10.30
Modify Date : 2024:02:13 08:39:49

This shows that I edited the image this morning using GIMP installed on a Linux box.

You can change your phone’s settings to not include location data in photos. If you do, the photos may still include the time zone, which is a weak form of location data. You can remove some or all the metadata later using image editing software, but by default a photo reveals more than you may intend.

More metadata posts

Related posts

Your PDF may reveal more than you intend

When you create a PDF file, what you see is not all you get. There is metadata embedded in the file that might be useful. It also might reveal information you’d rather not reveal.

The previous post looked at just the time stamp on a file. This post will look at more metadata, focusing on privacy implications.

Inspecting metadata

Here’s a little Python script we’ll use to inspect some of the metadata in a PDF. I say some because this does not pick out everything in every PDF.

    from pypdf import PdfReader

    def print_metadata(filename):
        print("File: ", filename, "\n")    
        reader = PdfReader(filename)
        meta = reader.metadata
        for m in meta:
            print(m, meta[m])

Let’s run this on the “Hello world” example from the previous post.

    File:  humpty.pdf

    /Creator Writer
    /Producer LibreOffice 7.5
    /CreationDate D:20240208064322-06'00'

OK, so this shows that the file was created with LibreOffice Writer, version 7.5.

Time and location

It also shows when the file was written. As I discussed in the previous post, the file was written today at 6:43:22. But what I didn’t comment on before was the -6'00' at the end. This is my time zone, six hours behind GMT, i.e. US Central Standard Time.

Note that the time zone isn’t just time information, it’s also location information. It’s no secret that I live in Houston, but if I didn’t want to reveal my location, this time stamp would partially give away where I live. (Probably. Strictly speaking it reveals the time zone setting on my computer.)

Microsoft Word files

I repeated my “Hello world” file experiment with Microsoft Word on an old laptop. When I exported to PDF I got the following.

    /Author John Cook
    /Creator Microsoft® Word 2016
    /CreationDate D:20240208101055-06'00'
    /ModDate D:20240208101055-06'00'
    /Producer Microsoft® Word 2016

So this includes my name. The installation program for Microsoft Office asks for your name, and I must have provided it. Either LibreOffice doesn’t ask or I didn’t enter it.

When I print to PDF rather than export to PDF I get slightly different output.

    /Author John
    /CreationDate D:20240208101220-06'00'
    /ModDate D:20240208101220-06'00'
    /Producer Microsoft: Print To PDF
    /Title Microsoft Word - Document1

LaTeX files

Now let’s look at a PDF created from a LaTeX file. I created a file foo.tex with the following content

    \documentclass{article}
    \begin{document}
    Hello world.
    \end{document}

then compiled it with pdflatex foo.tex. Let’s see what metadata our Python code can find.

    /Producer pdfTeX-1.40.25
    /Creator TeX
    /CreationDate D:20240208075059-06'00'
    /ModDate D:20240208075059-06'00'
    /Trapped /False
    /PTEX.Fullbanner This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023/MacPorts 2023.66589_1) kpathsea version 6.3.5

Obviously the file was created with TeX [1]. You can usually identify TeX files by their appearance. You can make a TeX file look less distinctive by changing the default font and a few other things. But if you did so without changing the metadata, someone could still determine that the file was made using TeX.

I’m not trying to conceal that I use LaTeX. But if you create a PDF with an obscure program, maybe that reveals more than you’d like to reveal.

Operating system

You can see that the file was produced on a Mac. When I compiled the same file on my Linux desktop, it showed the operating system as Debian but was not any more specific.

When you see that a file was created using Microsoft Word, it was probably created on Windows. I don’t have Word on my Mac, but I wouldn’t be surprised if the application was reported to be something like Office for MacOS rather than just Word.

I created a document with Microsoft 365 online and it reported the following.

    /Author John Cook
    /Creator Microsoft Word
    /CreationDate D:20240208084209-08'00'
    /ModDate D:20240208084209-08'00'

The lack of an operating system in the Creator field may indicate that the document was created online. Note that the time zone is −8, i.e. Pacific Standard Time. This isn’t my time zone but the time zone of the server, perhaps in Seattle.

Related posts

[1] LaTeX is written on top of TeX. The metadata says the file was created with TeX, because ultimately it really was.

The Five Safes data privacy framework

Five safes

The Five Safes decision framework was created a couple decades ago by Felix Ritchie at the UK Office for National Statistics. It is a framework for evaluating the safe use of confidential data, particularly by government agencies. You can find a description of the Five Safes, for example, in NIST SP 800-188.

The Five Safes are

  1. Safe projects
  2. Safe people
  3. Safe settings
  4. Safe data
  5. Safe outputs

Safe projects asks whether the use of the data is appropriate. It doesn’t matter how safe the access controls and so forth are if the project itself is inappropriate.

Safe people asks whether the users be trusted to use the data in an appropriate manner. For health care data, for example, one could ask whether users have had HIPAA training.

Safe settings applies to physical access. Does the facility hosting the data limit unauthorised access?

Safe data asks about statistical disclosure control, whether the data itself poses a disclosure risk. For example, have the data been adequately deidentified?

Safe outputs asks whether the output of the project poses a privacy risk.

Various approaches to data privacy have different trade-offs between the Five Safes. Differential privacy focuses on safe outputs. There are mathematical guarantees that the outputs satisfy a certain definition of privacy. The data itself is regarded as unsafe, and so it is important that the people and settings are safe.

HIPAA expert determination focuses on safe data. Often there is a sort of firewall with data considered safe on one side for one set of reasons (patient consent, a BAA contract, etc.) and considered safe on the other side of the wall because the data itself is safe, i.e. properly deidentified.

Safe Harbor is unrelated to the Five Safes. Safe Harbor is a provision under the HIPAA Privacy Rule for deeming certain data safe. Whether the Safe Harbor rules actually result in safe data depends on context. Data may comply with the letter of the law appealing to Safe Harbor, and yet not protect individuals in the data from being identified.

If you would like help evaluating the privacy aspects of a data analysis project, let’s talk.

Weak encryption and surveillance

Two of the first things you learn in cryptography are that simple substitution ciphers are very easy to break, and that security by obscurity is a bad idea. This post will revisit both of these ideas.

Security depends on your threat model. If the threat you want to protect against is a human reading your encrypted messages, then simple substitution ciphers are indeed weak. Anyone capable of working a cryptogram in a puzzle book is capable of breaking your encryption.

But if your threat model is impersonal surveillance, things are different. It might not take much to thwart a corporation scanning your email for ideas of what to advertise to you. Even something as simple as rot13 might be enough.

The point of this article is not to recommend rot13, or any other simple substitution cipher, but to elaborate on the idea that evading commercial surveillance is different from evading intelligence agencies. It’s easier to evade bots than spooks.

If you’re the target of a federal investigation, the most sophisticated encryption at your disposal may be inadequate. But if you’re the target of an advertising company, things are much easier.

Rot13

Rot13 works by moving letters ahead 13 positions in the alphabet. The nth letter goes to the (n + 13)th letter mod 26. So A’s become N’s, and N’s become A’s, etc. Rot13 offers no security at all, but it is used to prevent text from being immediately recognizable. A common application is to publish the punchline of jokes.

Rot13 converts the word “pyrex” to “clerk.” If you use the word “pyrex” in an email, but rot13 encrypt your message, you’re unlikely to see advertisements for glassware unless you also talk about a clerk.

It is conceivable, but doubtful, that impersonal surveillance software would try to detect rot13 encoding even though it would be easy to do. But detecting and decrypting simple substitution ciphers in general, while certainly possible, would not be worth the effort. Security-by-obscurity could protect you from surveillance because it’s not profitable for mass surveillance to pursue anything obscure. For example, the Playfair cipher, broken over a century ago, would presumably throw off bots.

Modern but deprecated encryption

Simple substitution ciphers are ancient. Modern encryption methods, even deprecated methods like DES, are far more secure. For example, DES (Data Encryption Standard) is considered obsolete because it can be broken by a multiprocessor machine in under 24 hours. However, commercial surveillance is unwilling to spend processor-days to decrypt one person’s communication.

This is not to recommend DES. If it’s just as easy to use much better algorithms like AES (Advanced Encryption Standard) then why not do that? My purpose in bringing up DES is to say that squabbles over the merits of various modern encryption methods are beside the point if your goal is to evade impersonal surveillance.

Everyone agrees DES should be put out to pasture, but it doesn’t matter if you’re just trying to avoid surveillance. Things that not everyone agrees on matter even less.

What matters more than the algorithms is who holds the keys. A system in which you alone hold a DES key may give you more privacy than a system that holds an AES key for you. Companies may want to protect your private data from competitors but have no qualms about using it themselves.

Why surveillance matters

Why go to any effort to evade corporate surveillance?

Companies share data, and companies get hacked. One innocuous piece of data may be the link that connects two databases together. Things that you don’t mind revealing may unlock things you don’t want to reveal.

Related: A statistical problem with nothing to hide

Randomize, then humanize

Yesterday I wrote about a way to memorize a random 256-bit encryption key. This isn’t trivial, but it’s doable using memory techniques.

There’s a much easier way to create a memorable encryption key: start with something memorable, then apply a hash function. Why not just do that?

There are two conflicting criteria to satisfy: cryptographic strength and ease of memorization. Any password is a compromise between these two goals.

You get better security if you generate a random password, then try to make it memorable through some technique for making it more palatable to a human mind. You get something easier to remember if you start with something human-friendly and apply some process to make it appear more random.

In a nutshell, you can either randomize then humanize, or humanize then randomize.

Humanize then randomize, or randomize then humanize

You get better ease of use if you humanize then randomize. You get better security if you randomize then humanize.

This morning I ran across a paper by Arnold Reinhold suggesting that people generate 10-digit passwords by first generating 10 random letters, then create a mnemonic sentence with each word starting with one of the letters. Reinhold says that this leads to a greater variety of passwords than if you were to start with mnemonic sentence and somehow reduce it to 10 letters. This is an example of the randomize-then-humanize pattern.

Why?

There are more possibilities for an attacker to have to explore if you start with random input.

For example, suppose a site requires an 8-character password and you choose an 8-letter word. There are only about 30,000 English words with eight letters [1], and people are far more likely to choose some of these words than others. If you randomly choose a 3-character password using digits and letters (upper case and lower case) there are 623 = 238,328 possibilities. A three-character random password is far better than an 8-character word.

In Reinhold’s example, there are 2610 possible passwords made of 10 lowercase letters. That’s over 100 trillion possibilities. There are certainly fewer than 100 trillion pass phrases that humans are likely to come up with. Say you want to use your favorite sentence from a famous book. Suppose there are 1,000 famous books and each has 10,000 sentences. That’s only 10 million possibilities.

Human-generated randomness

People are not that good at generating randomness. Here’s a passage from David Kahn’s book The Codebreakers about the results of asking typists to create pages of random numbers for use in one-time pads.

Interestingly, some pads seem to be produced by typists and not by machines. They show strike-overs and erasures — neither likely to be made by machines. More significant are statistical analyses of the digits. One such pad, for example, has seven times as many groups in which digits in the 1-to-5 group alternate with digits in the 6-to-0 group, like 18293, as a purely random arrangement would have. This suggests that the typist is striking alternately with her left hand (which would type the 1-to-5 group on a Continental machine) and her right (which would type the 6-to-0 group). Again, instead of just half the groups beginning with a low number, which would be expected in a random selection, three quarters of them do, possibly because the typist is spacing with her right hand, then starting a new group with her left. Fewer doubles and triples appear than chance expects.

How hacks work

Web sites implicitly use the humanize-then-randomize approach. When you create a password, the site hashes what you type and stores the hashed value. (A naive site might store the actual password.) Then the next time you log in, the site hashes your password input and compares it to the stored value.

If the site is hacked, and the site’s hashing algorithm is known, then many of these passwords can be recovered. This happens routinely. If you apply a hash function to a list of 10,000 common passwords, there are only 10,000 hash values, and you simply search the hacked list for these values. And since people often reuse passwords, someone who knows your password on one site can try that password on another site.

If you use a randomly generated password for each site, it’s less likely any individual password will be exposed. And if a password is exposed, a hacker cannot use it on another site.

Related posts

[1] On my Macbook, grep '^........$' words | wc -l returned 30,001. You’d get different results from searching different word lists, but your results wouldn’t vary too much.

What is the point of a public key fingerprint?

Public key cryptography uses two keys: a private key and a public key. The nature of these keys depends on the encryption scheme, such as whether one is using RSA, ECC, or some other method, but you can think of a key as a long number. A key may be a short list of numbers, but let’s just say a key is a number.

When you generate a public key, you get a pair of numbers, one that you keep secret and one that you publicize. Anyone can use your public key to encrypt a message that only you can decrypt, assuming only you have your private key.

If I give you my public key, say I post it on my website, how can you be sure that it’s really my key? Maybe my site has been hacked and I don’t even know it. Or maybe someone tampers with the site content between the time it leaves my server and arrives at your browser (though TLS is supposed to address that).

We could get on a phone call and you could read the key back to me. The problem with this approach is that keys are long. A 4096-bit RSA key, for example, encoded in hexadecimal, is 1024 characters long.

You could read off the last so many characters of the key, maybe telling me that what you received ends in 9F0D 38FA. That would probably be sufficient to tell one key from another—it’s very unlike you’d have two keys in a keychain that share the same last 32 bits—but that doesn’t mean the previous part of the key hasn’t been tampered with.

What you’d really like is a cryptographic hash of the public key, short enough to conveniently compare, but long enough that it would be infeasible for someone to alter the public key in such a way as to produce the same hash. And that’s exactly what a fingerprint is.

In GnuPG, the GNU implementation of PGP, a fingerprint is an 160-bit hash, which means 40 hexadecimal characters. Manually verifying 40 characters is feasible; manually verifying 1024 characters is not.

Why not reuse passwords?

Perhaps you’ve heard that you should not reuse passwords but don’t understand why. After all, you have a really good password, one that nobody would ever guess, so why not use it everywhere?

Your password is not as good as you think

First of all, people who think they have a really good password are usually wrong. They implicitly assume that an attacker would have to recreate the devious process they went through to create the password. This is not the case. A password needs to be hard for an attacker to crack, not hard for the owner to create, and these are not the same thing. The former requires passwords that are long and randomly generated.

Credential stuffing

Suppose you use the same password on sites X and Y. Then site X gets hacked, revealing your password. Now your user name (most likely your email address) and password is part of a list posted online somewhere. Hackers then use scripts to see whether the same credentials will work somewhere else, and they often do. If they try site Y, then your account on Y has been compromised as well.

Now suppose you use a different username at the two sites. This is better, but the hack on X still means that your password has been added to a list of known passwords.

The worst case scenario is a site that stores passwords in the clear. This practice has long been discouraged, but the more times you reuse a password, the higher the chance that you’ll use your password at a site that does not hash passwords.

Hashing

Most sites these days don’t store your password per se but a cryptographic hash of your password. When you enter your password, the site hashes it and compares the result to the hashed value it has on file. If they match, you’re in.

If your hashed password for X is part of a breach, and the hashing algorithms for X and Y known, an attacker can try hashing a list of known passwords and maybe one of them will match the hash of your password on Y.

If sites X and Y use different hashing algorithms, then a hash of your password for X is not directly useful to hacking your account on site Y. But it is possible to “unhash” passwords, especially if a site uses a hashing algorithm that has been broken. This takes a lot of computation, but people do it.

Example

This is not hypothetical. It has happened many times. For example, it was part of what lead to the recent 23andMe breach. And when a hacker obtains one person’s family tree data, they obtain data on many other people at the same time. If you used a unique, long, randomly generated password on 23andMe but your cousin used password123, then your data may have been stolen.

What to do?

What can you do? You almost have to use a password manager. A strong password is necessarily hard to remember, and memorizing hundreds of strong passwords would be quite a chore. Still, you might want to memorize one or two strong passwords if they protect something really valuable.

Related posts