Why do medical tests always have error rates?

Most people implicitly assume medical tests are infallible. If they test positive for X, they assume they have X. Or if they test negative for X, they’re confident they don’t have X. Neither is necessarily true.

Someone recently asked me why medical tests always have an error rate. It’s a good question.

A test is necessarily a proxy, a substitute for something else. You don’t run a test to determine whether someone has a gunshot wound: you just look.

A test for a disease is based on a pattern someone discovered. People who have a certain condition usually have a certain chemical in their blood, and people who do not have the condition usually do not have that chemical in their blood. But it’s not a direct observation of the disease.

“Usually” is as good as you can do in biology. It’s difficult to make universal statements. When I first started working with genetic data, I said something to a colleague about genetic sequences being made of A, C, G, and T. I was shocked when he replied “Usually. This is biology. Nothing always holds.” It turns out some viruses have a Zs (aminoadenine) rather than As (adenine).

Error rates may be very low and still be misleading. A positive test for rare disease is usually a false positive, even if the test is highly accurate. This is a canonical example of the need to use Bayes theorem. The details are written up here.

The more often you test for something, the more often you’ll find it, correctly or incorrectly. The more conditions you test, the more conditions you find, correctly or incorrectly.

Wouldn’t it be great if your toilet had a built in lab that tested you for hundreds of diseases several times a day? No, it would not! The result would be frequent false positives, leading to unnecessary anxiety and expense.

Up to this point we’ve discussed medical tests, but the same applies to any kind of test. Surveillance is a kind of test, one that also has false positives and false negatives. The more often Big Brother reads you email, the more likely it is that he will find something justifying a knock on your door.

Related posts

Are guidance documents laws?

Are guidance documents laws? No, but they can have legal significance.

The people who generate regulatory guidance documents are not legislators. Legislators delegate to agencies to make rules, and agencies delegate to other organizations to make guidelines. For example [1],

Even HHS, which has express cybersecurity rulemaking authority under the Health Insurance Portability and Accountability Act (HIPAA), has put a lot of the details of what it considers adequate cybersecurity into non-binding guidelines.

I’m not a lawyer, so nothing I can should be considered legal advice. However, the authors of [1] are lawyers.

The legal status of guidance documents is contested. According to [2], Executive Order 13892 said that agencies

may not treat noncompliance with a standard of conduct announced solely in a guidance document as itself a violation of applicable statutes or regulations.

Makes sense to me, but EO 13992 revoked EO 13892.

Again according to [3],

Under the common law, it used to be that government advisories, guidelines, and other non-binding statements were non-binding hearsay [in private litigation]. However, in 1975, the Fifth Circuit held that advisory materials … are an exception to the hearsay rule … It’s not clear if this is now the majority rule.

In short, it’s fuzzy.

Update: On June 28, 2024 the US Supreme Court overruled the 1984 decision Chevron v. Natural Resources Defense Council. The Chevron doctrine held that courts should defer to regulatory agencies regarding the interpretation of regulations. In a pair of decisions, the Supreme Court held that the Chevron doctrine was a violation of separation of powers, giving executive agencies judicial power.

 

[1] Jim Dempsey and John P. Carlin. Cybersecurity Law Fundamentals, Second Edition, page 245.

[2] Ibid., page 199.

[3] Ibid., page 200.

 

Breach Safe Harbor

In the context of medical data, Safe Harbor typically refers to the Safe Harbor provisions of the HIPAA Privacy Rule explained here. Breach Safe Harbor is a little different. It basically means you’re off the hook if you breach encrypted health data. (But not necessarily. More on that below.)

I’m not a lawyer, so this isn’t legal advice. Even the HHS, who coin the term “Breach Safe Harbor” in their guidance portal, weasels out of saying they’re giving legal guidance by saying “The contents of this database lack the force and effect of law, except as authorized by law …”

Quality of encryption

You can’t just say that data were encrypted before they were breached. Weak encryption won’t cut it. You have to use acceptable algorithms and procedures.

How can you know whether you’ve encrypted data well enough to be covered Breach Safe Harbor? HHS cites four NIST publications for further guidance. (Not that I’m giving legal advice. I’m merely citing the HHS, who also is not giving legal advice.)

Here are the four publications.

Maybe encryption isn’t enough

At one point Tennessee law said a breach of encrypted data was still a breach. According to Dempsey and Carlin [1]

In 2016, Tennessee repealed its encryption safe harbor, requiring notice of breach of even encrypted data, but then in 2017, after criticism, the state restored a safe harbor for “information that has been encrypted in accordance with the current version of the Federal Information Processing Standard (FIPS) 140-2 if the encryption key has not been acquired by an unauthorized person.”

This is interesting for a couple reasons. First, there is a precedent for requiring notification of encrypted data. Second, this underscores the point above that encryption in general is not sufficient to avoid having to give notice of a breach: standard-compliant encryption is sufficient.

Consulting help

If you would like technical or statistical advice on how to prevent or prepare for a data breach, or how to respond after a data breach after the fact, we can help.

 

[1] Jim Dempsey and John P. Carlin. Cybersecurity Law Fundamentals, Second Edition.

Uncovering names masked with stars

Sometimes I’ll see things like my name partially concealed as J*** C*** and think “a lot of good that does.”

Masking letters reveals more than people realize. For example, when you see that someone’s first name is four letters and begins with J, there’s about a 70% chance they’re male and there’s a 44% chance they’re named John. If you know this person is male, there’s a 63% chance they’re name is John.

If you know a man’s name has the form J***, his name isn’t necessarily John, though that’s the most likely possibility. There’s a 8% chance his name is Jack and a 6% chance his name is Joel.

All these numbers depend on the data set you’re looking at, but these are roughly accurate numbers for looking at any representative sample of American names.

Some names stand out more than others. If I tell you someone’s name is E********, there’s a 90% chance the name is Elizabeth.

If I tell you someone’s name is B*****, there’s a 77% chance this person is female, but it’s harder to guess which name is hers. The most likely possibility is Brenda, but there are several other possibilities that are fairly likely: Bonnie, Brooke, Brandy, etc.

We could go through a similar exercise with last names. You can probably guess who S**** is, though C***** is not so clear.

In short, replacing letters with stars doesn’t do much to conceal someone’s name. It usually doesn’t let you infer someone’s name with certainty, but it definitely improves your chances of guessing correctly. If you have a few good guesses as to someone’s name, and some good guesses on a handful of other attributes, together you have a good chance of identifying someone.

Related posts

When is less data less private?

If I give you a database, I give you every row in the database. So if you delete some rows from the database, you have less information, not more, right?

This seems very simple, and it mostly is, but there are a couple subtleties.

A common measure in data privacy is k-anonymity. The idea is that if at least k individuals in a data set share some set of data values, and k is large enough, then the privacy of those individuals is protected.

Now suppose you randomly select a single record from a database that was deemed deidentified because it satisfied k-anonymity with k = 10. Now your new dataset, consisting of only one record, is k-anonymous with k = 1: every record is unique because there’s only one record. But how is this person’s data any less private that it was before?

Note that I said above that you selected a record at random. If you selected the row using information that you know but which isn’t in the database, you might have implicitly added information. But if you select a subset of data, using only information explicit in that data, you haven’t added information.

Here’s where k-anonymity breaks down. The important measure is k-anonymity in the general population, not k-anonymity in a data set, unless you know that someone is in the data set.

If you find someone named John Cook in a data set, you probably haven’t found my information, even if there is only one person by that name in the data set. My name may or may not be common in that particular data set, but my name is common in general.

The number of times a combination of data fields gives a lower bound on how often the combination appears in general, so k-anonymity in a data set is a good sign for privacy, but the lack of k-anonymity is not necessarily a bad sign. The latter could just be an artifact of having a small data set.

Related posts

Frequency analysis

Suppose you have a list of encrypted surnames names of US citizens. If the list is long enough, the encrypted name that occurs most often probably corresponds to Smith. The second most common encrypted name probably corresponds to Johnson, and so forth. This kind of inference is analogous to solving a cryptogram puzzle by counting letter frequencies.

The probability of correctly guessing the most common names based on frequency analysis depends critically in the sample size. In a small sample, there may be no Smiths. In a larger sample, the name Smith may be common, but not the most common.

I did some simulations to estimate how well frequency analysis would work at identifying the 10 most common names as a function of the sample size N. For each N, I simulated 100 data sets using probabilities derived from the surname frequencies derived from US Census Bureau data.

When N = 1,000, there was a 53% chance that the most common name in the population, Smith, would be the most common name in the sample. The second most common name in the population, Johnson, was the second most common name in the sample only 14% of the time.

When N = 10,000, there was a 94% chance of identifying Smith, and at least a 30% chance of identifying the five most common names.

When N = 1,000,000, the three most common names were identified every time in the simulation. And each of the 10 most common names were correctly identified most of the time. In fact, the 18 most common names were correctly identified most of the time.

A consequence of this analysis is that hashing names does not protect privacy if the sample size is large. Hashing names along with other information, so that the combined data has a more uniform distribution, may protect privacy.

Related posts

How much metadata is in a photo?

A few days ago I wrote about the privacy implications of metadata in a PDF. This post will do the same for photos.

Dalek on a Seattle train

You can see the metadata in a photo using exiftool. By default cameras include time and location data. I ran this tool on a photo I took in Seattle a few years ago when I was doing some work for Amazon. The tool reported 114 fields, some of which are redundant. Here is some of the information contained in the metadata.

GPS Altitude  : 72.5 m Above Sea Level
GPS Date/Time : 2017:05:05 17:47:33.31Z
GPS Position  : 47 deg 36' 39.71" N, 122 deg 19' 59.40" W
Lens ID       : iPhone SE back camera 4.15mm f/2.2

How finely does this specify the location? The coordinates are given to 1/100 of a second, so 1/360000 of a degree. A degree of latitude is 111 km, so the implied accuracy is on the order of 30 cm or one foot, whether that’s correct or not.

You can look up that ground level at that location is 46 meters above sea level, which would imply the photo was taken on the 8th floor of a building. (It clearly wasn’t. Either the elevation of ground level or the elevation recorded in the phone isn’t correct.)

When I cropped the image, the edited image contained the software and operating system that was used to edit it.

Platform    : Linux
Software    : GIMP 2.10.30
Modify Date : 2024:02:13 08:39:49

This shows that I edited the image this morning using GIMP installed on a Linux box.

You can change your phone’s settings to not include location data in photos. If you do, the photos may still include the time zone, which is a weak form of location data. You can remove some or all the metadata later using image editing software, but by default a photo reveals more than you may intend.

More metadata posts

Related posts

Your PDF may reveal more than you intend

When you create a PDF file, what you see is not all you get. There is metadata embedded in the file that might be useful. It also might reveal information you’d rather not reveal.

The previous post looked at just the time stamp on a file. This post will look at more metadata, focusing on privacy implications.

Inspecting metadata

Here’s a little Python script we’ll use to inspect some of the metadata in a PDF. I say some because this does not pick out everything in every PDF.

    from pypdf import PdfReader

    def print_metadata(filename):
        print("File: ", filename, "\n")    
        reader = PdfReader(filename)
        meta = reader.metadata
        for m in meta:
            print(m, meta[m])

Let’s run this on the “Hello world” example from the previous post.

    File:  humpty.pdf

    /Creator Writer
    /Producer LibreOffice 7.5
    /CreationDate D:20240208064322-06'00'

OK, so this shows that the file was created with LibreOffice Writer, version 7.5.

Time and location

It also shows when the file was written. As I discussed in the previous post, the file was written today at 6:43:22. But what I didn’t comment on before was the -6'00' at the end. This is my time zone, six hours behind GMT, i.e. US Central Standard Time.

Note that the time zone isn’t just time information, it’s also location information. It’s no secret that I live in Houston, but if I didn’t want to reveal my location, this time stamp would partially give away where I live. (Probably. Strictly speaking it reveals the time zone setting on my computer.)

Microsoft Word files

I repeated my “Hello world” file experiment with Microsoft Word on an old laptop. When I exported to PDF I got the following.

    /Author John Cook
    /Creator Microsoft® Word 2016
    /CreationDate D:20240208101055-06'00'
    /ModDate D:20240208101055-06'00'
    /Producer Microsoft® Word 2016

So this includes my name. The installation program for Microsoft Office asks for your name, and I must have provided it. Either LibreOffice doesn’t ask or I didn’t enter it.

When I print to PDF rather than export to PDF I get slightly different output.

    /Author John
    /CreationDate D:20240208101220-06'00'
    /ModDate D:20240208101220-06'00'
    /Producer Microsoft: Print To PDF
    /Title Microsoft Word - Document1

LaTeX files

Now let’s look at a PDF created from a LaTeX file. I created a file foo.tex with the following content

    \documentclass{article}
    \begin{document}
    Hello world.
    \end{document}

then compiled it with pdflatex foo.tex. Let’s see what metadata our Python code can find.

    /Producer pdfTeX-1.40.25
    /Creator TeX
    /CreationDate D:20240208075059-06'00'
    /ModDate D:20240208075059-06'00'
    /Trapped /False
    /PTEX.Fullbanner This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023/MacPorts 2023.66589_1) kpathsea version 6.3.5

Obviously the file was created with TeX [1]. You can usually identify TeX files by their appearance. You can make a TeX file look less distinctive by changing the default font and a few other things. But if you did so without changing the metadata, someone could still determine that the file was made using TeX.

I’m not trying to conceal that I use LaTeX. But if you create a PDF with an obscure program, maybe that reveals more than you’d like to reveal.

Operating system

You can see that the file was produced on a Mac. When I compiled the same file on my Linux desktop, it showed the operating system as Debian but was not any more specific.

When you see that a file was created using Microsoft Word, it was probably created on Windows. I don’t have Word on my Mac, but I wouldn’t be surprised if the application was reported to be something like Office for MacOS rather than just Word.

I created a document with Microsoft 365 online and it reported the following.

    /Author John Cook
    /Creator Microsoft Word
    /CreationDate D:20240208084209-08'00'
    /ModDate D:20240208084209-08'00'

The lack of an operating system in the Creator field may indicate that the document was created online. Note that the time zone is −8, i.e. Pacific Standard Time. This isn’t my time zone but the time zone of the server, perhaps in Seattle.

Related posts

[1] LaTeX is written on top of TeX. The metadata says the file was created with TeX, because ultimately it really was.

The Five Safes data privacy framework

Five safes

The Five Safes decision framework was created a couple decades ago by Felix Ritchie at the UK Office for National Statistics. It is a framework for evaluating the safe use of confidential data, particularly by government agencies. You can find a description of the Five Safes, for example, in NIST SP 800-188.

The Five Safes are

  1. Safe projects
  2. Safe people
  3. Safe settings
  4. Safe data
  5. Safe outputs

Safe projects asks whether the use of the data is appropriate. It doesn’t matter how safe the access controls and so forth are if the project itself is inappropriate.

Safe people asks whether the users be trusted to use the data in an appropriate manner. For health care data, for example, one could ask whether users have had HIPAA training.

Safe settings applies to physical access. Does the facility hosting the data limit unauthorised access?

Safe data asks about statistical disclosure control, whether the data itself poses a disclosure risk. For example, have the data been adequately deidentified?

Safe outputs asks whether the output of the project poses a privacy risk.

Various approaches to data privacy have different trade-offs between the Five Safes. Differential privacy focuses on safe outputs. There are mathematical guarantees that the outputs satisfy a certain definition of privacy. The data itself is regarded as unsafe, and so it is important that the people and settings are safe.

HIPAA expert determination focuses on safe data. Often there is a sort of firewall with data considered safe on one side for one set of reasons (patient consent, a BAA contract, etc.) and considered safe on the other side of the wall because the data itself is safe, i.e. properly deidentified.

Safe Harbor is unrelated to the Five Safes. Safe Harbor is a provision under the HIPAA Privacy Rule for deeming certain data safe. Whether the Safe Harbor rules actually result in safe data depends on context. Data may comply with the letter of the law appealing to Safe Harbor, and yet not protect individuals in the data from being identified.

If you would like help evaluating the privacy aspects of a data analysis project, let’s talk.

Weak encryption and surveillance

Two of the first things you learn in cryptography are that simple substitution ciphers are very easy to break, and that security by obscurity is a bad idea. This post will revisit both of these ideas.

Security depends on your threat model. If the threat you want to protect against is a human reading your encrypted messages, then simple substitution ciphers are indeed weak. Anyone capable of working a cryptogram in a puzzle book is capable of breaking your encryption.

But if your threat model is impersonal surveillance, things are different. It might not take much to thwart a corporation scanning your email for ideas of what to advertise to you. Even something as simple as rot13 might be enough.

The point of this article is not to recommend rot13, or any other simple substitution cipher, but to elaborate on the idea that evading commercial surveillance is different from evading intelligence agencies. It’s easier to evade bots than spooks.

If you’re the target of a federal investigation, the most sophisticated encryption at your disposal may be inadequate. But if you’re the target of an advertising company, things are much easier.

Rot13

Rot13 works by moving letters ahead 13 positions in the alphabet. The nth letter goes to the (n + 13)th letter mod 26. So A’s become N’s, and N’s become A’s, etc. Rot13 offers no security at all, but it is used to prevent text from being immediately recognizable. A common application is to publish the punchline of jokes.

Rot13 converts the word “pyrex” to “clerk.” If you use the word “pyrex” in an email, but rot13 encrypt your message, you’re unlikely to see advertisements for glassware unless you also talk about a clerk.

It is conceivable, but doubtful, that impersonal surveillance software would try to detect rot13 encoding even though it would be easy to do. But detecting and decrypting simple substitution ciphers in general, while certainly possible, would not be worth the effort. Security-by-obscurity could protect you from surveillance because it’s not profitable for mass surveillance to pursue anything obscure. For example, the Playfair cipher, broken over a century ago, would presumably throw off bots.

Modern but deprecated encryption

Simple substitution ciphers are ancient. Modern encryption methods, even deprecated methods like DES, are far more secure. For example, DES (Data Encryption Standard) is considered obsolete because it can be broken by a multiprocessor machine in under 24 hours. However, commercial surveillance is unwilling to spend processor-days to decrypt one person’s communication.

This is not to recommend DES. If it’s just as easy to use much better algorithms like AES (Advanced Encryption Standard) then why not do that? My purpose in bringing up DES is to say that squabbles over the merits of various modern encryption methods are beside the point if your goal is to evade impersonal surveillance.

Everyone agrees DES should be put out to pasture, but it doesn’t matter if you’re just trying to avoid surveillance. Things that not everyone agrees on matter even less.

What matters more than the algorithms is who holds the keys. A system in which you alone hold a DES key may give you more privacy than a system that holds an AES key for you. Companies may want to protect your private data from competitors but have no qualms about using it themselves.

Why surveillance matters

Why go to any effort to evade corporate surveillance?

Companies share data, and companies get hacked. One innocuous piece of data may be the link that connects two databases together. Things that you don’t mind revealing may unlock things you don’t want to reveal.

Related: A statistical problem with nothing to hide