If you save a file as PDF twice, you get two different files

If you save a file as a PDF twice, you won’t get exactly the same file both times. To illustrate this, I created an LibreOffice document containing “Hello world.” and saved it twice, first as humpty.pdf then as dumpty.pdf. Then I compared the two files.

    % diff humpty.pdf dumpty.pdf
    Binary files humpty.pdf and dumpty.pdf differ

That’s curious. Both files are the same size, 7260 bytes, but something is different inside.

Next I dumped both files to hexadecimal and compared the output.

.   % diff  <(xxd humpty.pdf) <(xxd dumpty.pdf)

This produced two ranges of differences. Here's the first:

    < 00001a60: ... 3232 ...  064322-06'00')>>
    ---
    > 00001a60: ... 3339 ...  064339-06'00')>>

The files differ in two consecutive bytes. The ASCII representation at the end of the lines shows what these bytes mean. Apparently these two bytes are part of a time stamp. The first was produced at 6:43:22 this morning and the second was produced 17 seconds later at 6:43:39.

There's another block of differences further down the file. I'll leave out the hex representation of the bytes to save space and just include the positions and the ASCII representation.

    < 00001bc0: ...  13 0 R./ID [ <CB
    < 00001bd0: ...  4185E1FB366E0C64
    < 00001be0: ...  D65ADF317ACB6A>.
    < 00001bf0: ...  <CB4185E1FB366E0
    < 00001c00: ...  C64D65ADF317ACB6
    < 00001c10: ...  A> ]./DocChecksu
    < 00001c20: ...  m /59EF0E5B9A2CC
    < 00001c30: ...  4AEC9FD90E7BBE23
    < 00001c40: ...  0CC.>>.startxref
    ---
    > 00001bc0: ...  13 0 R./ID [ <7D
    > 00001bd0: ...  1441609E44A5446A
    > 00001be0: ...  8A0F9A4E96FF49>.
    > 00001bf0: ...  <7D1441609E44A54
    > 00001c00: ...  46A8A0F9A4E96FF4
    > 00001c10: ...  9> ]./DocChecksu
    > 00001c20: ...  m /A7A3CD305537E
    > 00001c30: ...  B3DC35BA5EB4678F
    > 00001c40: ...  EDA.>>.startxref

The text DocChecksum jumps out. This looks like a 32-bit check sum. If I had to guess, I'd say it's probably CRC-32. And apparently there's some sort of 32-bit hash before the checksum: CB4 ...B6A in humpty.pdf and 7D1...F49 in dumpty.pdf. This must be some sort of hash. The hash is repeated twice in each file. Maybe this is some sort of versioning information, and the hash is repeated because the initial and final versions of the file are the same.

The fact that the files were saved 17 seconds apart changed two bytes in the timestamps. But changing these two bytes caused the two 32-byte hash codes to change.

Related posts

2 thoughts on “If you save a file as PDF twice, you get two different files

  1. No man ever creates the same pdf twice, for it’s not the same pdf and he’s not the same man.
    — Heraclitus

Leave a Reply

Your email address will not be published. Required fields are marked *