How much metadata is in a photo?

Posted on 13 February 2024 by John

A few days ago I wrote about the privacy implications of metadata in a PDF. This post will do the same for photos.

Dalek on a Seattle train

You can see the metadata in a photo using exiftool. By default cameras include time and location data. I ran this tool on a photo I took in Seattle a few years ago when I was doing some work for Amazon. The tool reported 114 fields, some of which are redundant. Here is some of the information contained in the metadata.

GPS Altitude  : 72.5 m Above Sea Level
GPS Date/Time : 2017:05:05 17:47:33.31Z
GPS Position  : 47 deg 36' 39.71" N, 122 deg 19' 59.40" W
Lens ID       : iPhone SE back camera 4.15mm f/2.2

How finely does this specify the location? The coordinates are given to 1/100 of a second, so 1/360000 of a degree. A degree of latitude is 111 km, so the implied accuracy is on the order of 30 cm or one foot, whether that’s correct or not.

You can look up that ground level at that location is 46 meters above sea level, which would imply the photo was taken on the 8th floor of a building. (It clearly wasn’t. Either the elevation of ground level or the elevation recorded in the phone isn’t correct.)

When I cropped the image, the edited image contained the software and operating system that was used to edit it.

Platform    : Linux
Software    : GIMP 2.10.30
Modify Date : 2024:02:13 08:39:49

This shows that I edited the image this morning using GIMP installed on a Linux box.

You can change your phone’s settings to not include location data in photos. If you do, the photos may still include the time zone, which is a weak form of location data. You can remove some or all the metadata later using image editing software, but by default a photo reveals more than you may intend.

Your PDF may reveal more than you intend

Posted on 8 February 2024 by John

When you create a PDF file, what you see is not all you get. There is metadata embedded in the file that might be useful. It also might reveal information you’d rather not reveal.

The previous post looked at just the time stamp on a file. This post will look at more metadata, focusing on privacy implications.

Inspecting metadata

Here’s a little Python script we’ll use to inspect some of the metadata in a PDF. I say some because this does not pick out everything in every PDF.

    from pypdf import PdfReader

    def print_metadata(filename):
        print("File: ", filename, "\n")    
        reader = PdfReader(filename)
        meta = reader.metadata
        for m in meta:
            print(m, meta[m])

Let’s run this on the “Hello world” example from the previous post.

    File:  humpty.pdf

    /Creator Writer
    /Producer LibreOffice 7.5
    /CreationDate D:20240208064322-06'00'

OK, so this shows that the file was created with LibreOffice Writer, version 7.5.

Time and location

It also shows when the file was written. As I discussed in the previous post, the file was written today at 6:43:22. But what I didn’t comment on before was the -6'00' at the end. This is my time zone, six hours behind GMT, i.e. US Central Standard Time.

Note that the time zone isn’t just time information, it’s also location information. It’s no secret that I live in Houston, but if I didn’t want to reveal my location, this time stamp would partially give away where I live. (Probably. Strictly speaking it reveals the time zone setting on my computer.)

Microsoft Word files

I repeated my “Hello world” file experiment with Microsoft Word on an old laptop. When I exported to PDF I got the following.

    /Author John Cook
    /Creator Microsoft® Word 2016
    /CreationDate D:20240208101055-06'00'
    /ModDate D:20240208101055-06'00'
    /Producer Microsoft® Word 2016

So this includes my name. The installation program for Microsoft Office asks for your name, and I must have provided it. Either LibreOffice doesn’t ask or I didn’t enter it.

When I print to PDF rather than export to PDF I get slightly different output.

    /Author John
    /CreationDate D:20240208101220-06'00'
    /ModDate D:20240208101220-06'00'
    /Producer Microsoft: Print To PDF
    /Title Microsoft Word - Document1

LaTeX files

Now let’s look at a PDF created from a LaTeX file. I created a file foo.tex with the following content

    \documentclass{article}
    \begin{document}
    Hello world.
    \end{document}

then compiled it with pdflatex foo.tex. Let’s see what metadata our Python code can find.

    /Producer pdfTeX-1.40.25
    /Creator TeX
    /CreationDate D:20240208075059-06'00'
    /ModDate D:20240208075059-06'00'
    /Trapped /False
    /PTEX.Fullbanner This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023/MacPorts 2023.66589_1) kpathsea version 6.3.5

Obviously the file was created with TeX [1]. You can usually identify TeX files by their appearance. You can make a TeX file look less distinctive by changing the default font and a few other things. But if you did so without changing the metadata, someone could still determine that the file was made using TeX.

I’m not trying to conceal that I use LaTeX. But if you create a PDF with an obscure program, maybe that reveals more than you’d like to reveal.

Operating system

You can see that the file was produced on a Mac. When I compiled the same file on my Linux desktop, it showed the operating system as Debian but was not any more specific.

When you see that a file was created using Microsoft Word, it was probably created on Windows. I don’t have Word on my Mac, but I wouldn’t be surprised if the application was reported to be something like Office for MacOS rather than just Word.

I created a document with Microsoft 365 online and it reported the following.

    /Author John Cook
    /Creator Microsoft Word
    /CreationDate D:20240208084209-08'00'
    /ModDate D:20240208084209-08'00'

The lack of an operating system in the Creator field may indicate that the document was created online. Note that the time zone is −8, i.e. Pacific Standard Time. This isn’t my time zone but the time zone of the server, perhaps in Seattle.

[1] LaTeX is written on top of TeX. The metadata says the file was created with TeX, because ultimately it really was.

Personal information in digital photos

Posted on 17 November 2023 by John

Is it possible to identify the people in the photo above? Maybe. Digital images potentially contain a large amount of metadata that could reveal the photographer’s identify and location. There may also be a surprising number of clues in the photo itself.

EXIF metadata

The standard format for image metadata is EXIF, Exchangeable Image File Format. Some of this information is obviously identifiable, such as fields called CameraOwnerName, Photographer, and ImageEditor. A camera may or may not include such information, and someone may remove this image from photos after they are taken, but this image is possible inside the photo.

Similarly, the photo may include information regarding where the photo was taken, such as in the GPSLatitude, GPSLongitude, and GPSAltitude fields. There are also fields for recording when the photo was taken or edited.

A recurring theme in data privacy is that information that is not obviously identifiable may still be used to identify someone. If this data doesn’t do the whole job, it narrows down possibilities to the point that other known information may complete the identification.

For example, the highly technical fields contained in an image could identify the camera equipment. The camera serial number directly identifies the camera, but other fields may indirectly identify the camera.

Similarly, a image without GPS data still maybe contain indirect location. For example, there are fields for recording temperature, humidity, and atmospheric pressure. These fields used in combination with timestamps could identify a location, or at least narrow down the set of possible locations.

There are many EXIF fields that are allowed to be arbitrarily long ASCII or Unicode (UTF-8) sequences. A program for editing EXIF data would allow someone to copy the contents of Moby Dick into one of these fields.

The next post describes a similar situation for medical images.

Clues in the photo itself

Stripping EXIF data from an image before making it public is a good idea both for privacy and for size. If a free text field does contain Moby Dick, you could make your image 1.2 MB smaller by removing it.

However, it’s often possible to detect from the photo itself where the photo was taken. I stumbled on a YouTube channel of someone who identifies photos as a hobby. No doubt there are many such people. The host invites people to send in photos and he uses openly available information to track down where they are.

If you strip the precise time and location information from the metadata, someone may be able to infer approximate replacements from clues in the photo itself such as shadows or seasonal vegetation.

Ordinary people have no idea how much location information can be inferred from a photo. Neither do some people who ought to know better. There was a story a few months ago about a photo at a secret military location whose position was inferred from, among other clues, stars that faintly appeared in the sky near dusk.

Update: As noted in the comments, Facebook has a patent on a way to identify people from the pattern of dust on their camera lenses.

Photo by Evgeniy Prokofiev on Unsplash

What can you learn from a phone number?

Posted on 17 November 2023 by John

What can someone learn about you from your phone number?

The answer depends on what other information someone has. Identifiers always depend on context. To a naked man in a tree [1] the phone number doesn’t carry any information. But to someone with a list of names and phone numbers, some sort of reverse phone number look up, it might tell them your name.

A while back I wrote about area codes and how they are distributed among states. NANPA publicly posts data that goes into greater detail with central office codes. Using this data, you can look up the first six digits of a phone number and find more specifically where the central office associated with the number is located geographically.

For example, take the phone number 469 863 7090. This is a business phone number, and so you could type it into a search engine and find out exactly whose number it is. But if that weren’t possible, you could look up 469-863 in the NANPA database to find that the number is located in Frisco, Texas. In fact, the number belongs to Sky Rocket Burger. Recommended.

Now people can move around and keep their mobile phone numbers, so any kind of phone look up may tell you about where someone used to be rather than where they are. That could be even more useful.

[1] A lawyer once told me that his law school professor said that the only thing the interstate commerce clause of the US Constitution doesn’t apply to is a naked man in a tree.

What can you learn from a credit card number?

Posted on 17 April 2023 by John

The first 4 to 6 digits of a credit card number are the bank identification number or BIN. The information needed to decode a BIN is publicly available, with some effort, and so anyone could tell from a credit card number what institution issued it, what bank it draws on, whether its a personal or business card, etc.

Suppose your credit card number was exposed in a data breach. Someone makes a suspicious purchase with your card, the issuer contacts you, you cancel the card, and you get a new card from the same source. The number can no longer be used to make purchases on your account, but what information did it leave behind?

The cancelled number might tell someone where you used to bank, which is probably where you still bank. And it may tell them the first few digits of your new card since the new card is issued by the same institution [1]. If the old BIN doesn’t directly reveal your new BIN, it at least narrows down the possibilities.

The information in your BIN, by itself, will not identify you, but it does provide clues that might lead to identifying you when combined with other information.

[1] According to Andrew in the comments, American Express often changes credit card numbers as little as possible when issuing a replacement, changing only one content digit and the checksum.

Computed IDs and privacy implications

Posted on 31 October 2019 by John

Thirty years ago, a lot of US states thought it would be a good idea to compute someone’s drivers license number (DLN) from their personal information [1]. In 1991, fifteen states simply used your Social Security Number as your DLN. Eleven other states computed DLNs by applying a hash function to personal information such as name, birth date, and sex. A few other states based DLNs in part but not entirely on personal information.

Presumably things have changed a lot since then. If you know of any states that still do this, please let me know in the comments. Even if states have stopped computing DLNs from personal data, I’m sure many organizations still compute IDs this way.

The article I stumbled on from 1991 gave no hint perhaps encoding personal information into an ID number could be a problem. And at the time it wasn’t as much of a problem as it would be now.

Why is it a problem if IDs are computed from personal data? People don’t realize what information they’re giving away. Maybe they would be willing to give someone their personal information, but not their DLN, or vice versa, not realizing that the two are equivalent. They also don’t realize what information about them someone may already have; a little bit more info may be all an attacker needs. And they don’t realize the potential consequences of their loss of privacy.

In some cases the hashing functions were complicated, but not too complicated to carry out by hand. And even if states were applying a cryptographic hash function, which they certainly were not, this would still be a problem for reasons explained here. If you have a database of personal information, say from voter registration records, you could compute the hash value of everyone in the state, or at least a large enough portion that you stand a good chance of being able to reverse a hashed value.

[1] Joseph A. Gallian. Assigning Driver’s License Numbers. Mathematics Magazine, Vol. 64, No. 1 (Feb., 1991), pp. 13-22.

No funding for uncomfortable results

Posted on 7 December 2018 by John

In 1997 Latanya Sweeney dramatically demonstrated that supposedly anonymized data was not anonymous. The state of Massachusetts had released data on 135,000 state employees and their families with obvious identifiers removed. However, the data contained zip code, birth date, and sex for each individual. Sweeney was able to cross reference this data with publicly available voter registration data to find the medical records of then Massachusetts governor William Weld.

An estimated 87% of Americans can be identified by the combination of zip code, birth date, and sex. A back-of-the-envelope calculation shows that this should not be surprising, but Sweeney appears to be the first to do this calculation and pursue the results. (Update: See such a calculation in the next post.)

In her paper Only You, Your Doctor, and Many Others May Know Sweeney says that her research was unwelcome. Over 20 journals turned down her paper on the Weld study, and nobody wanted to fund privacy research that might reach uncomfortable conclusions.

A decade ago, funding sources refused to fund re-identification experiments unless there was a promise that results would likely show that no risk existed or that all problems could be solved by some promising new theoretical technology under development. Financial resources were unavailable to support rigorous scientific studies otherwise.

There’s a perennial debate over whether it is best to make security and privacy flaws public or to suppress them. The consensus, as much as there is a consensus, is that one should reveal flaws discreetly at first and then err on the side of openness. For example, a security researcher finding a vulnerability in Windows would notify Microsoft first and give the company a chance to fix the problem before announcing the vulnerability publicly. In Sweeney’s case, however, there was no single responsible party who could quietly fix the world’s privacy vulnerabilities. Calling attention to the problem was the only way to make things better.

OSINT

How much metadata is in a photo?

More metadata posts

Related posts

Your PDF may reveal more than you intend

Inspecting metadata

Time and location

Microsoft Word files

LaTeX files

Operating system

Related posts

Personal information in digital photos

EXIF metadata

Clues in the photo itself

Related posts

What can you learn from a phone number?

Related posts

What can you learn from a credit card number?

Related posts

Computed IDs and privacy implications

Related posts

No funding for uncomfortable results

More privacy posts