Uncategorized

Your PDF may reveal more than you intend

When you create a PDF file, what you see is not all you get. There is metadata embedded in the file that might be useful. It also might reveal information you’d rather not reveal.

The previous post looked at just the time stamp on a file. This post will look at more metadata, focusing on privacy implications.

Inspecting metadata

Here’s a little Python script we’ll use to inspect some of the metadata in a PDF. I say some because this does not pick out everything in every PDF.

    from pypdf import PdfReader

    def print_metadata(filename):
        print("File: ", filename, "\n")    
        reader = PdfReader(filename)
        meta = reader.metadata
        for m in meta:
            print(m, meta[m])

Let’s run this on the “Hello world” example from the previous post.

    File:  humpty.pdf

    /Creator Writer
    /Producer LibreOffice 7.5
    /CreationDate D:20240208064322-06'00'

OK, so this shows that the file was created with LibreOffice Writer, version 7.5.

Time and location

It also shows when the file was written. As I discussed in the previous post, the file was written today at 6:43:22. But what I didn’t comment on before was the -6'00' at the end. This is my time zone, six hours behind GMT, i.e. US Central Standard Time.

Note that the time zone isn’t just time information, it’s also location information. It’s no secret that I live in Houston, but if I didn’t want to reveal my location, this time stamp would partially give away where I live. (Probably. Strictly speaking it reveals the time zone setting on my computer.)

Microsoft Word files

I repeated my “Hello world” file experiment with Microsoft Word on an old laptop. When I exported to PDF I got the following.

    /Author John Cook
    /Creator Microsoft® Word 2016
    /CreationDate D:20240208101055-06'00'
    /ModDate D:20240208101055-06'00'
    /Producer Microsoft® Word 2016

So this includes my name. The installation program for Microsoft Office asks for your name, and I must have provided it. Either LibreOffice doesn’t ask or I didn’t enter it.

When I print to PDF rather than export to PDF I get slightly different output.

    /Author John
    /CreationDate D:20240208101220-06'00'
    /ModDate D:20240208101220-06'00'
    /Producer Microsoft: Print To PDF
    /Title Microsoft Word - Document1

LaTeX files

Now let’s look at a PDF created from a LaTeX file. I created a file foo.tex with the following content

    \documentclass{article}
    \begin{document}
    Hello world.
    \end{document}

then compiled it with pdflatex foo.tex. Let’s see what metadata our Python code can find.

    /Producer pdfTeX-1.40.25
    /Creator TeX
    /CreationDate D:20240208075059-06'00'
    /ModDate D:20240208075059-06'00'
    /Trapped /False
    /PTEX.Fullbanner This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023/MacPorts 2023.66589_1) kpathsea version 6.3.5

Obviously the file was created with TeX [1]. You can usually identify TeX files by their appearance. You can make a TeX file look less distinctive by changing the default font and a few other things. But if you did so without changing the metadata, someone could still determine that the file was made using TeX.

I’m not trying to conceal that I use LaTeX. But if you create a PDF with an obscure program, maybe that reveals more than you’d like to reveal.

Operating system

You can see that the file was produced on a Mac. When I compiled the same file on my Linux desktop, it showed the operating system as Debian but was not any more specific.

When you see that a file was created using Microsoft Word, it was probably created on Windows. I don’t have Word on my Mac, but I wouldn’t be surprised if the application was reported to be something like Office for MacOS rather than just Word.

I created a document with Microsoft 365 online and it reported the following.

    /Author John Cook
    /Creator Microsoft Word
    /CreationDate D:20240208084209-08'00'
    /ModDate D:20240208084209-08'00'

The lack of an operating system in the Creator field may indicate that the document was created online. Note that the time zone is −8, i.e. Pacific Standard Time. This isn’t my time zone but the time zone of the server, perhaps in Seattle.

Related posts

[1] LaTeX is written on top of TeX. The metadata says the file was created with TeX, because ultimately it really was.

Bad takes on chaos theory

I just finished reading The Three Body Problem. At the end of the book is a preview of Cixin Liu’s book Supernova Era. A bit of dialog in that preview stood out to me because it is touches on themes I’ve written about before.

“I’ve heard about that. When a butterfly flaps its wings, there’s a hurricane on the other side of the world.”

“That’s right,” Specs said, nodding. A chaotic system.”

Huahua said, “I want to be that butterfly.”

Specs should his head again. “You don’t understand at all. We’re all butterflies, just like every butterfly. Every grain of sand and every drop of rain is a butterfly. That’s why the world is unpredictable.”

Most popular interpretations of chaos theory are misguided. Two such misguided interpretations are illustrated in the passage above.

When hearing of chaos theory, many jump to the same conclusion as the Huahua in the excerpt above who wants to be the butterfly that starts a hurricane. They think chaos theory implies that butterfly effects can be engineered. This is the optimistic fallacy.

Sometimes a small deliberate effort can lead to a large intended conclusion. But chaos theory would not predict this. In fact, reasoning by analogy from chaos theory would suggest this is impossible. More on that here and here.

Another misguided interpretation of chaos theory is the pessimistic fallacy that “the world is unpredictable,” as Specs says above. But we know that’s not true. Some aspects of the world are very predictable. As Orphan Annie says, the sun will come out tomorrow.

Even people who say the world is unpredictable don’t live as if the world were unpredictable. Deep down they know full well that in important ways the world is predictable.

Bell curve meme.

It’s true that not everything is as predictable as we may have imagined, weather being a famous example. Chaos theory was born out of the surprising observation that weather simulations are very sensitive to changes in initial conditions.

We do not live in a world in which we can tickle a particular butterfly in order to deliberately direct the course of the future. But neither do we live in a world without discernible causes and effects.

Straddling checkerboard encryption

Introduction

Computers fundamentally changed cryptography, opening up new possibilities for making and breaking codes. At first it may not have been clear which side benefited most, but now it’s clear that computers gave more power to code makers than code breakers.

We now have cryptographic primitives that cannot be attacked more efficiently than by brute force, as far as we know. The weak link is how these primitives are implemented and combined, not the primitives themselves.

Before computers there was more of a cat and mouse game between encryption and cryptanalysis. Encryption schemes that were convenient to carry out by hand could usually be broken by hand eventually. But if you only needed secrecy briefly, a simple scheme might provide that secrecy for long enough. This post will look at one such scheme, the straddling checkerboard.

Checkerboards

Perhaps the most obvious way to conveniently turn letters into numbers is to arrange the letters into a 5 × 5 grid. This has to leave out one letter, and in practice this meant combining I and J. Or if you needed digits, you could use a 6 × 6 grid and put J back in. You’d scramble the alphabet in the grid according to some key, then encrypt each letter by its coordinates.

       12345
      +-----
     1|EBISP
     2|XWLVN
     3|AOYZQ
     4|MDCKH
     5|RTUFG

This is no better than a simple substitution cipher because someone intercepting a message encrypted this way would easily guess that pairs of digits represent letters. However, if you then permuted the digits with a transposition cipher, you’d have something more formidable. This is essentially what the ADFGV cipher did, which stumped cryptanalysts for a while.

The straddling checkerboard is a variation on the method above. Letters would be arranged in a 3 × 10 grid rather than 5 × 5. Some letters would be encrypted as a single digit and some as a pair of digits.

       1234567890
      +----------
      |  EBISPXWL
     1|VNAOYZQMDC
     2|KHRTUFGJ./

In the example above, E would be encrypted as 3, N would be encrypted as 12, and so on. This is an instance of a prefix code. In order to be able to decode the digits unambiguously, no letter could be encoded as 1 or 2; these digits always signaled the beginning of a pair.

Prefix codes are often used in non-secret codes, such as country codes for telephone numbers. More examples of prefix codes in this post.

Because 1 and 2 could not be used to encode single letters, there were 28 slots to fill. These could be filled with other symbols, and in practice period and slash were added [1].

Efficiency

The straddling checkerboard gives a more efficient encoding than does the checkerboard since typically fewer digits will be required. If efficiency were the only concern, we’d put the eight most frequent letters on the top row, something like the following [2].

       1234567890
      +----------
      |  ETAOINSR
     1|BCDFGHJKLM
     2|PQUVWXYZ./

This would be more efficient but less secure since the arrangement of the letters would be more predictable.

Security

The straddling checkerboard presents a bit of a challenge to the cryptanalyst since it’s not know a priori whether a digit is part of a pair (if the vertical coordinates are not always 1 and 2).

The straddling checkerboard didn’t offer much security even in its day. It would have been better if there had been some further processing done on the digits, such as how the ADFGV cipher permuted its coordinates.

The message, interpreted as a number N, could have been further encrypted as aN + b where a and b were randomly chosen numbers that were part of the key. As far as I know, nothing like this was ever done. This would have provided more security but would also require more effort and increase the chance of introducing errors.

Related posts

[1] David Kahn. The Codebreakers. Chapter 18.

[2] You may have expected the last letter on the first row to be H, going by the printer’s order ETAOIN SHRDLU. Peter Norvig discovered a slightly different order of letter frequencies based on the Google corpus.

Email subscription changes

I will soon be discontinuing the email subscription option for this blog. I recommend that email subscribers switch over to subscribing to the RSS feed for the blog. If you’re unfamiliar with RSS, here is an article on how to get started.

(I recommend RSS in general, and not just for subscribing to this blog. RSS lets you decide what sources you want to receive rather than leave the choice up to someone else’s algorithm.)

Update: I have replaced my email subscription service with a Substack account.

You can also find me on Twitter and on Mastodon at johndcook@mathstodon.xyz.

The Million Dollar Matrix Multiply

The following post is by Wayne Joubert, the newest member of our consulting team. Wayne recently retired from his position as a Senior Computational Scientist at Oak Ridge National Laboratory. — John

Training large language models like GPT-4 costs many millions of dollars in server expenses. These costs are expected to trend to billions of dollars over the next few years [1]. One of the biggest computational expenses of LLM training is multiplying matrices. These are simple operations of the form C = AB. Matrix multiplies are common not only in AI model training but also many high performance computing applications from diverse science domains.

Eking out more speed from matrix multiplies could reduce AI model training costs by millions of dollars. More routinely, such improvements could reduce training runtime by hours on a single GPU-powered workstation or cut down cloud service provider expenses significantly.

What is less well-known is that matrix multiples run on graphics processing units (GPUs) that are typically used for model training have many exotic performance behaviors that can drastically reduce matrix multiply efficiency by a wide margin.

Two recent works [2], [3] examine these phenomena in considerable depth. Factors such as matrix size, alignment of data in memory, power throttling, math library versions, chip-level manufacturing variability, and even the values of the matrix entries can significantly affect performance. At the same time, much of this variability can be modeled by machine learning methods such as decision trees and random forests [2].

Use of these methods can be the first step toward implementing autotuning techniques to minimize costs. Using such methods or carefully applying rules of thumb for performance optimization can make a huge performance difference for matrix multiply-heavy GPU software.

Related posts

[1] What large models cost you—there is no free AI lunch

[2] Wayne Joubert, Eric Palmer and Verónica G. Melesse Vergara, “Matrix Multiply Performance of GPUs on Exascale-class HPE/Cray Systems,” Proceedings of the Cray User Group Meeting (CUG) 2022, https://www.osti.gov/biblio/2224210.

[3] P. Sinha, A. Guliani, R. Jain, B. Tran, M. D. Sinclair and S. Venkataraman, “Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich Systems,” SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, 2022, pp. 01-15, doi: 10.1109/SC41404.2022.00070.

Square root factorial

What factorial is closest to the square root of 2024 factorial?

A good guess would be 1012, based on the idea that √(n!) might be near (n/2)!.

This isn’t correct—the actual answer is 1112—but it’s not wildly off.

Could it be that (2n)! is asymptotically (n!)²?

No, Gauss’ duplication formula

\Gamma(2z) = \frac{1}{\sqrt{2\pi}} 2^{2z - 1/2} \Gamma(z) \Gamma\left(z + \frac{1}{2}\right)

shows that the ratio of (2n)! to (n!)² grows exponentially as a function of n.

However, the ratio only grows exponentially, and factorials grow faster than exponentially. I believe that the value of m minimizing

| √(n!) – m! |

asymptotically approaches n/2. In other words, the inverse factorial of  √(n!) approaches n/2. And more generally the inverse factorial of (n!)1/k asymptotically approaches n/k. I haven’t written out a proof, but the plot below shows numerical evidence.

So √(n!) is not asymptotically (n/2)!, but the inverse factorial of √(n!) is asymptotically n/2.

See the next post for a way to compute inverse factorials.

Randomize, then humanize

Yesterday I wrote about a way to memorize a random 256-bit encryption key. This isn’t trivial, but it’s doable using memory techniques.

There’s a much easier way to create a memorable encryption key: start with something memorable, then apply a hash function. Why not just do that?

There are two conflicting criteria to satisfy: cryptographic strength and ease of memorization. Any password is a compromise between these two goals.

You get better security if you generate a random password, then try to make it memorable through some technique for making it more palatable to a human mind. You get something easier to remember if you start with something human-friendly and apply some process to make it appear more random.

In a nutshell, you can either randomize then humanize, or humanize then randomize.

Humanize then randomize, or randomize then humanize

You get better ease of use if you humanize then randomize. You get better security if you randomize then humanize.

This morning I ran across a paper by Arnold Reinhold suggesting that people generate 10-digit passwords by first generating 10 random letters, then create a mnemonic sentence with each word starting with one of the letters. Reinhold says that this leads to a greater variety of passwords than if you were to start with mnemonic sentence and somehow reduce it to 10 letters. This is an example of the randomize-then-humanize pattern.

Why?

There are more possibilities for an attacker to have to explore if you start with random input.

For example, suppose a site requires an 8-character password and you choose an 8-letter word. There are only about 30,000 English words with eight letters [1], and people are far more likely to choose some of these words than others. If you randomly choose a 3-character password using digits and letters (upper case and lower case) there are 623 = 238,328 possibilities. A three-character random password is far better than an 8-character word.

In Reinhold’s example, there are 2610 possible passwords made of 10 lowercase letters. That’s over 100 trillion possibilities. There are certainly fewer than 100 trillion pass phrases that humans are likely to come up with. Say you want to use your favorite sentence from a famous book. Suppose there are 1,000 famous books and each has 10,000 sentences. That’s only 10 million possibilities.

Human-generated randomness

People are not that good at generating randomness. Here’s a passage from David Kahn’s book The Codebreakers about the results of asking typists to create pages of random numbers for use in one-time pads.

Interestingly, some pads seem to be produced by typists and not by machines. They show strike-overs and erasures — neither likely to be made by machines. More significant are statistical analyses of the digits. One such pad, for example, has seven times as many groups in which digits in the 1-to-5 group alternate with digits in the 6-to-0 group, like 18293, as a purely random arrangement would have. This suggests that the typist is striking alternately with her left hand (which would type the 1-to-5 group on a Continental machine) and her right (which would type the 6-to-0 group). Again, instead of just half the groups beginning with a low number, which would be expected in a random selection, three quarters of them do, possibly because the typist is spacing with her right hand, then starting a new group with her left. Fewer doubles and triples appear than chance expects.

How hacks work

Websites implicitly use the humanize-then-randomize approach. When you create a password, the site hashes what you type and stores the hashed value. (A naive site might store the actual password.) Then the next time you log in, the site hashes your password input and compares it to the stored value.

If the site is hacked, and the site’s hashing algorithm is known, then many of these passwords can be recovered. This happens routinely. If you apply a hash function to a list of 10,000 common passwords, there are only 10,000 hash values, and you simply search the hacked list for these values. And since people often reuse passwords, someone who knows your password on one site can try that password on another site.

If you use a randomly generated password for each site, it’s less likely any individual password will be exposed. And if a password is exposed, a hacker cannot use it on another site.

Related posts

[1] On my Macbook, grep '^........$' words | wc -l returned 30,001. You’d get different results from searching different word lists, but your results wouldn’t vary too much.

Example of memorizing a 256-bit private key

There are techniques that can enable anyone to memorize much more than may seem possible. This post will show how I generated and memorized a 256-bit encryption key this morning using the approach explained here.

TANSTAAFL

There ain’t no such thing as a free lunch. This saying is abbreviated TANSTAAFL in Heinlein’s novel The Moon is a Harsh Mistress. It takes effort to safe effort.

Memorization techniques make it easier to remember new things if you invest effort into the techniques. The more you invest into the techniques, the easier memorizing new things is.

Whether this investment is practical depends on the person. I find it more interesting than practical; I rarely need to memorize anything. But I know of people who have used these techniques to great advantage.

Generating a key

First, I’ll generate a 256-bit number using Python.

>>> import secrets
>>> s = secrets.randombits(256)

This produced the following:

14769232028620221959695310712392700269168526908419649910136349315042507303581

If you were to run the same code you would not get the same number, which is the point of the secrets module. It seeds a random number generator with entropy extracted from the state of the computer it is running on.

Parsing digits

The number above has 77 digits, so I split it into a two-digit number, 14, and 25 three-digit numbers: 769, 232, …, 581. Then using Major mnemonic system encoding, I associate each number with a letter of the NATO phonetic alphabet.

I encode 14 as “tar”, and the NATO word for the letter A is “alpha,” so I imagine an alpha male covered in tar.

I encode 769 as “ketchup,” and the NATO word for B is “bravo,” so I imagine a brave bottle of ketchup, arms akimbo, and wearing a superhero cape.

I encode 581 as “elevator” [1], and the phonetic word for Z is Zulu, and so I imagine Shaka Zulu riding in an elevator.

Using this process I memorized the random number above in a few minutes.

Just for fun I asked DALL-E 2 to produce an image of “Shaka Zulu, spear in hand, riding in an elevator” the image below is one of the ones it created.

Shaka Zulu, spear in hand, riding in an elevator

Related posts

[1] Strictly speaking, “elevator” decodes as 5814. But I use a common convention of only considering the first three consonants in a word. This is a good trade-off because it’s not likely you could encode a 4-digit number in a single word.

Top twelve posts of 2023

Illustrating Euclid Proposition XIII.10

These were the most popular posts I wrote in 2023.

Privacy and encryption

Geometry

Number theory

Linear algebra

Miscellaneous