Deleting vs Replacing Names

Posted on 19 June 2025 by John

This post looks at whether you should delete names or replace names when deidentifying personal data. With structured data, generating synthetic names does not increase or decrease privacy. But with unstructured data, replacing real names with randomly generated names increases privacy protection.

Structured data

If you want to deidentify structured data (i.e. data separated into columns in a database) then an obvious first step is to remove a column containing names. This is not sufficient—the reason deidentification is subtle is that it is possible to identify people after the obvious identifiers have been removed—but clearly removing names is necessary.

Instead of removing the names, you could replace them with randomly generated names. This neither hurts nor helps privacy. But this might be useful, say, when creating test data.

Say you’re developing software that handles patient data. You want to use real patient data because you want the test data to be realistic, but that may not be acceptable (or legal) due to privacy risks. You can’t just remove names because you need to test whether software displays names correctly, so you replace real names with randomly generated names. No harm, no foul.

Unstructured data

Now let’s switch contexts and look at unstructured data, such as text transcriptions of doctors’ notes. You still need to remove names, but it’s not as simple as removing the name column. Instead of the database structure telling you what’s a name, what’s a diagnosis, etc. you have to infer these things statistically [1], which means there will always be errors [2].

When working with unstructured text, replacing names with randomly generated names is better for privacy. If you could identify names with 100% accuracy, then it would make no difference whether you removed or replaced names, but you can’t. Software will always miss some names [3].

Suppose your software replaces suspected names with “NAME REDACTED.” When your software fails to identify a name, it’s obvious that it failed. For example,

NAME REDACTED and his wife Margaret came into the clinic today to be tested for …

But if instead your software replaced suspected names with synthetic names, it is not obvious when the software fails. When you see a reference to Margaret, you can’t know whether the patient’s name was Virginia and was replaced with Margaret or whether the software made an error skipped over Margaret’s name.

All else being equal, it’s better to synthesize names than remove names. But that doesn’t mean that just any process for synthesizing names will be adequate. The error rate doesn’t need to be zero, but it can’t be too high either. And the process should not be biased. If the software consistently left Hispanic names slip through, for example, Hispanic patients would not appreciate that.

[1] “But what if I use a large language model?” That’s a particular kind of statistical inference.

[2] Any statistical test, such as testing whether a string of text represents a name, will have false positives (type I error) and false negatives (type II error). Software to remove names will remove some text that isn’t a name, and fail to recognize some text as names.

[3] We’ve tested a lot of deidentification software packages for clients, and the error rate is never zero. But privacy regulations such as HIPAA don’t require the error rate to be zero, only sufficiently small.

Confidential OCR

Posted on 20 November 2024 by John

A client emailed me a screenshot of a table rather than pasting the table as text into an email.

I thought about using an LLM to convert it to text, but the table is confidential client information and so I shouldn’t upload it anywhere.

I searched for a commandline utility to do OCR and found tesseract. I installed it with

    sudo apt install tesseract-ocr libtesseract-dev tesseract-ocr-eng

and ran it with the default settings

    tesseract screenshot.png textfile

It worked remarkably well. I had to change a C to a U, but otherwise I didn’t have to add or change any text, but I did have to delete a few extraneous parentheses generated by the software.

I work locally in part out of habit; it was the only way to work when I started using a computer. It has numerous advantages, such as being able to keep working when a hurricane knocks out my internet connection, but above all it is private.

I pay more attention to privacy than is convenient because I work in data privacy. And aside from my privacy, I have to protect our clients’ privacy.

Update: According to the comments, ChatGPT uses tesseract. Assuming that’s true, using tesseract directly is better than ChatGPT because it does exactly what you want. No ambiguity as far as what expected. No potential for tinkering with your results before you see them.

Why do medical tests always have error rates?

Posted on 22 July 2024 by John

Most people implicitly assume medical tests are infallible. If they test positive for X, they assume they have X. Or if they test negative for X, they’re confident they don’t have X. Neither is necessarily true.

Someone recently asked me why medical tests always have an error rate. It’s a good question.

A test is necessarily a proxy, a substitute for something else. You don’t run a test to determine whether someone has a gunshot wound: you just look.

A test for a disease is based on a pattern someone discovered. People who have a certain condition usually have a certain chemical in their blood, and people who do not have the condition usually do not have that chemical in their blood. But it’s not a direct observation of the disease.

“Usually” is as good as you can do in biology. It’s difficult to make universal statements. When I first started working with genetic data, I said something to a colleague about genetic sequences being made of A, C, G, and T. I was shocked when he replied “Usually. This is biology. Nothing always holds.” It turns out some viruses have a Zs (aminoadenine) rather than As (adenine).

Error rates may be very low and still be misleading. A positive test for rare disease is usually a false positive, even if the test is highly accurate. This is a canonical example of the need to use Bayes theorem. The details are written up here.

The more often you test for something, the more often you’ll find it, correctly or incorrectly. The more conditions you test, the more conditions you find, correctly or incorrectly.

Wouldn’t it be great if your toilet had a built in lab that tested you for hundreds of diseases several times a day? No, it would not! The result would be frequent false positives, leading to unnecessary anxiety and expense.

Up to this point we’ve discussed medical tests, but the same applies to any kind of test. Surveillance is a kind of test, one that also has false positives and false negatives. The more often Big Brother reads you email, the more likely it is that he will find something justifying a knock on your door.

Are guidance documents laws?

Posted on 5 April 2024 by John

Are guidance documents laws? No, but they can have legal significance.

The people who generate regulatory guidance documents are not legislators. Legislators delegate to agencies to make rules, and agencies delegate to other organizations to make guidelines. For example [1],

Even HHS, which has express cybersecurity rulemaking authority under the Health Insurance Portability and Accountability Act (HIPAA), has put a lot of the details of what it considers adequate cybersecurity into non-binding guidelines.

I’m not a lawyer, so nothing I can should be considered legal advice. However, the authors of [1] are lawyers.

The legal status of guidance documents is contested. According to [2], Executive Order 13892 said that agencies

may not treat noncompliance with a standard of conduct announced solely in a guidance document as itself a violation of applicable statutes or regulations.

Makes sense to me, but EO 13992 revoked EO 13892.

Again according to [3],

Under the common law, it used to be that government advisories, guidelines, and other non-binding statements were non-binding hearsay [in private litigation]. However, in 1975, the Fifth Circuit held that advisory materials … are an exception to the hearsay rule … It’s not clear if this is now the majority rule.

In short, it’s fuzzy.

Update: On June 28, 2024 the US Supreme Court overruled the 1984 decision Chevron v. Natural Resources Defense Council. The Chevron doctrine held that courts should defer to regulatory agencies regarding the interpretation of regulations. In a pair of decisions, the Supreme Court held that the Chevron doctrine was a violation of separation of powers, giving executive agencies judicial power.

[1] Jim Dempsey and John P. Carlin. Cybersecurity Law Fundamentals, Second Edition, page 245.

[2] Ibid., page 199.

[3] Ibid., page 200.

Breach Safe Harbor

Posted on 21 March 2024 by John

In the context of medical data, Safe Harbor typically refers to the Safe Harbor provisions of the HIPAA Privacy Rule explained here. Breach Safe Harbor is a little different. It basically means you’re off the hook if you breach encrypted health data. (But not necessarily. More on that below.)

I’m not a lawyer, so this isn’t legal advice. Even the HHS, who coin the term “Breach Safe Harbor” in their guidance portal, weasels out of saying they’re giving legal guidance by saying “The contents of this database lack the force and effect of law, except as authorized by law …”

Quality of encryption

You can’t just say that data were encrypted before they were breached. Weak encryption won’t cut it. You have to use acceptable algorithms and procedures.

How can you know whether you’ve encrypted data well enough to be covered Breach Safe Harbor? HHS cites four NIST publications for further guidance. (Not that I’m giving legal advice. I’m merely citing the HHS, who also is not giving legal advice.)

Here are the four publications.

Maybe encryption isn’t enough

At one point Tennessee law said a breach of encrypted data was still a breach. According to Dempsey and Carlin [1]

In 2016, Tennessee repealed its encryption safe harbor, requiring notice of breach of even encrypted data, but then in 2017, after criticism, the state restored a safe harbor for “information that has been encrypted in accordance with the current version of the Federal Information Processing Standard (FIPS) 140-2 if the encryption key has not been acquired by an unauthorized person.”

This is interesting for a couple reasons. First, there is a precedent for requiring notification of encrypted data. Second, this underscores the point above that encryption in general is not sufficient to avoid having to give notice of a breach: standard-compliant encryption is sufficient.

Consulting help

If you would like technical or statistical advice on how to prevent or prepare for a data breach, or how to respond after a data breach after the fact, we can help.

[1] Jim Dempsey and John P. Carlin. Cybersecurity Law Fundamentals, Second Edition.

Uncovering names masked with stars

Posted on 23 February 2024 by John

Sometimes I’ll see things like my name partially concealed as J*** C*** and think “a lot of good that does.”

Masking letters reveals more than people realize. For example, when you see that someone’s first name is four letters and begins with J, there’s about a 70% chance they’re male and there’s a 44% chance they’re named John. If you know this person is male, there’s a 63% chance they’re name is John.

If you know a man’s name has the form J***, his name isn’t necessarily John, though that’s the most likely possibility. There’s a 8% chance his name is Jack and a 6% chance his name is Joel.

All these numbers depend on the data set you’re looking at, but these are roughly accurate numbers for looking at any representative sample of American names.

Some names stand out more than others. If I tell you someone’s name is E********, there’s a 90% chance the name is Elizabeth.

If I tell you someone’s name is B*****, there’s a 77% chance this person is female, but it’s harder to guess which name is hers. The most likely possibility is Brenda, but there are several other possibilities that are fairly likely: Bonnie, Brooke, Brandy, etc.

We could go through a similar exercise with last names. You can probably guess who S**** is, though C***** is not so clear.

In short, replacing letters with stars doesn’t do much to conceal someone’s name. It usually doesn’t let you infer someone’s name with certainty, but it definitely improves your chances of guessing correctly. If you have a few good guesses as to someone’s name, and some good guesses on a handful of other attributes, together you have a good chance of identifying someone.

When is less data less private?

Posted on 22 February 2024 by John

If I give you a database, I give you every row in the database. So if you delete some rows from the database, you have less information, not more, right?

This seems very simple, and it mostly is, but there are a couple subtleties.

A common measure in data privacy is k-anonymity. The idea is that if at least k individuals in a data set share some set of data values, and k is large enough, then the privacy of those individuals is protected.

Now suppose you randomly select a single record from a database that was deemed deidentified because it satisfied k-anonymity with k = 10. Now your new dataset, consisting of only one record, is k-anonymous with k = 1: every record is unique because there’s only one record. But how is this person’s data any less private that it was before?

Note that I said above that you selected a record at random. If you selected the row using information that you know but which isn’t in the database, you might have implicitly added information. But if you select a subset of data, using only information explicit in that data, you haven’t added information.

Here’s where k-anonymity breaks down. The important measure is k-anonymity in the general population, not k-anonymity in a data set, unless you know that someone is in the data set.

If you find someone named John Cook in a data set, you probably haven’t found my information, even if there is only one person by that name in the data set. My name may or may not be common in that particular data set, but my name is common in general.

The number of times a combination of data fields gives a lower bound on how often the combination appears in general, so k-anonymity in a data set is a good sign for privacy, but the lack of k-anonymity is not necessarily a bad sign. The latter could just be an artifact of having a small data set.

Frequency analysis

Posted on 17 February 2024 by John

Suppose you have a list of encrypted surnames names of US citizens. If the list is long enough, the encrypted name that occurs most often probably corresponds to Smith. The second most common encrypted name probably corresponds to Johnson, and so forth. This kind of inference is analogous to solving a cryptogram puzzle by counting letter frequencies.

The probability of correctly guessing the most common names based on frequency analysis depends critically in the sample size. In a small sample, there may be no Smiths. In a larger sample, the name Smith may be common, but not the most common.

I did some simulations to estimate how well frequency analysis would work at identifying the 10 most common names as a function of the sample size N. For each N, I simulated 100 data sets using probabilities derived from the surname frequencies derived from US Census Bureau data.

When N = 1,000, there was a 53% chance that the most common name in the population, Smith, would be the most common name in the sample. The second most common name in the population, Johnson, was the second most common name in the sample only 14% of the time.

When N = 10,000, there was a 94% chance of identifying Smith, and at least a 30% chance of identifying the five most common names.

When N = 1,000,000, the three most common names were identified every time in the simulation. And each of the 10 most common names were correctly identified most of the time. In fact, the 18 most common names were correctly identified most of the time.

A consequence of this analysis is that hashing names does not protect privacy if the sample size is large. Hashing names along with other information, so that the combined data has a more uniform distribution, may protect privacy.

How much metadata is in a photo?

Posted on 13 February 2024 by John

A few days ago I wrote about the privacy implications of metadata in a PDF. This post will do the same for photos.

Dalek on a Seattle train

You can see the metadata in a photo using exiftool. By default cameras include time and location data. I ran this tool on a photo I took in Seattle a few years ago when I was doing some work for Amazon. The tool reported 114 fields, some of which are redundant. Here is some of the information contained in the metadata.

GPS Altitude  : 72.5 m Above Sea Level
GPS Date/Time : 2017:05:05 17:47:33.31Z
GPS Position  : 47 deg 36' 39.71" N, 122 deg 19' 59.40" W
Lens ID       : iPhone SE back camera 4.15mm f/2.2

How finely does this specify the location? The coordinates are given to 1/100 of a second, so 1/360000 of a degree. A degree of latitude is 111 km, so the implied accuracy is on the order of 30 cm or one foot, whether that’s correct or not.

You can look up that ground level at that location is 46 meters above sea level, which would imply the photo was taken on the 8th floor of a building. (It clearly wasn’t. Either the elevation of ground level or the elevation recorded in the phone isn’t correct.)

When I cropped the image, the edited image contained the software and operating system that was used to edit it.

Platform    : Linux
Software    : GIMP 2.10.30
Modify Date : 2024:02:13 08:39:49

This shows that I edited the image this morning using GIMP installed on a Linux box.

You can change your phone’s settings to not include location data in photos. If you do, the photos may still include the time zone, which is a weak form of location data. You can remove some or all the metadata later using image editing software, but by default a photo reveals more than you may intend.

Your PDF may reveal more than you intend

Posted on 8 February 2024 by John

When you create a PDF file, what you see is not all you get. There is metadata embedded in the file that might be useful. It also might reveal information you’d rather not reveal.

The previous post looked at just the time stamp on a file. This post will look at more metadata, focusing on privacy implications.

Inspecting metadata

Here’s a little Python script we’ll use to inspect some of the metadata in a PDF. I say some because this does not pick out everything in every PDF.

    from pypdf import PdfReader

    def print_metadata(filename):
        print("File: ", filename, "\n")    
        reader = PdfReader(filename)
        meta = reader.metadata
        for m in meta:
            print(m, meta[m])

Let’s run this on the “Hello world” example from the previous post.

    File:  humpty.pdf

    /Creator Writer
    /Producer LibreOffice 7.5
    /CreationDate D:20240208064322-06'00'

OK, so this shows that the file was created with LibreOffice Writer, version 7.5.

Time and location

It also shows when the file was written. As I discussed in the previous post, the file was written today at 6:43:22. But what I didn’t comment on before was the -6'00' at the end. This is my time zone, six hours behind GMT, i.e. US Central Standard Time.

Note that the time zone isn’t just time information, it’s also location information. It’s no secret that I live in Houston, but if I didn’t want to reveal my location, this time stamp would partially give away where I live. (Probably. Strictly speaking it reveals the time zone setting on my computer.)

Microsoft Word files

I repeated my “Hello world” file experiment with Microsoft Word on an old laptop. When I exported to PDF I got the following.

    /Author John Cook
    /Creator Microsoft® Word 2016
    /CreationDate D:20240208101055-06'00'
    /ModDate D:20240208101055-06'00'
    /Producer Microsoft® Word 2016

So this includes my name. The installation program for Microsoft Office asks for your name, and I must have provided it. Either LibreOffice doesn’t ask or I didn’t enter it.

When I print to PDF rather than export to PDF I get slightly different output.

    /Author John
    /CreationDate D:20240208101220-06'00'
    /ModDate D:20240208101220-06'00'
    /Producer Microsoft: Print To PDF
    /Title Microsoft Word - Document1

LaTeX files

Now let’s look at a PDF created from a LaTeX file. I created a file foo.tex with the following content

    \documentclass{article}
    \begin{document}
    Hello world.
    \end{document}

then compiled it with pdflatex foo.tex. Let’s see what metadata our Python code can find.

    /Producer pdfTeX-1.40.25
    /Creator TeX
    /CreationDate D:20240208075059-06'00'
    /ModDate D:20240208075059-06'00'
    /Trapped /False
    /PTEX.Fullbanner This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023/MacPorts 2023.66589_1) kpathsea version 6.3.5

Obviously the file was created with TeX [1]. You can usually identify TeX files by their appearance. You can make a TeX file look less distinctive by changing the default font and a few other things. But if you did so without changing the metadata, someone could still determine that the file was made using TeX.

I’m not trying to conceal that I use LaTeX. But if you create a PDF with an obscure program, maybe that reveals more than you’d like to reveal.

Operating system

You can see that the file was produced on a Mac. When I compiled the same file on my Linux desktop, it showed the operating system as Debian but was not any more specific.

When you see that a file was created using Microsoft Word, it was probably created on Windows. I don’t have Word on my Mac, but I wouldn’t be surprised if the application was reported to be something like Office for MacOS rather than just Word.

I created a document with Microsoft 365 online and it reported the following.

    /Author John Cook
    /Creator Microsoft Word
    /CreationDate D:20240208084209-08'00'
    /ModDate D:20240208084209-08'00'

The lack of an operating system in the Creator field may indicate that the document was created online. Note that the time zone is −8, i.e. Pacific Standard Time. This isn’t my time zone but the time zone of the server, perhaps in Seattle.

[1] LaTeX is written on top of TeX. The metadata says the file was created with TeX, because ultimately it really was.

Privacy

Deleting vs Replacing Names

Structured data

Unstructured data

Related posts

Confidential OCR

Related posts

Why do medical tests always have error rates?

Related posts

Are guidance documents laws?

Breach Safe Harbor

Quality of encryption

Maybe encryption isn’t enough

Consulting help

Uncovering names masked with stars

Related posts

When is less data less private?

Related posts

Frequency analysis

Related posts

How much metadata is in a photo?

More metadata posts

Related posts

Your PDF may reveal more than you intend

Inspecting metadata

Time and location

Microsoft Word files

LaTeX files

Operating system

Related posts