If you save a file as a PDF twice, you won’t get exactly the same file both times. To illustrate this, I created an LibreOffice document containing “Hello world.” and saved it twice, first as humpty.pdf
then as dumpty.pdf
. Then I compared the two files.
% diff humpty.pdf dumpty.pdf Binary files humpty.pdf and dumpty.pdf differ
That’s curious. Both files are the same size, 7260 bytes, but something is different inside.
Next I dumped both files to hexadecimal and compared the output.
. % diff <(xxd humpty.pdf) <(xxd dumpty.pdf)
This produced two ranges of differences. Here's the first:
< 00001a60: ... 3232 ... 064322-06'00')>> --- > 00001a60: ... 3339 ... 064339-06'00')>>
The files differ in two consecutive bytes. The ASCII representation at the end of the lines shows what these bytes mean. Apparently these two bytes are part of a time stamp. The first was produced at 6:43:22 this morning and the second was produced 17 seconds later at 6:43:39.
There's another block of differences further down the file. I'll leave out the hex representation of the bytes to save space and just include the positions and the ASCII representation.
< 00001bc0: ... 13 0 R./ID [ <CB < 00001bd0: ... 4185E1FB366E0C64 < 00001be0: ... D65ADF317ACB6A>. < 00001bf0: ... <CB4185E1FB366E0 < 00001c00: ... C64D65ADF317ACB6 < 00001c10: ... A> ]./DocChecksu < 00001c20: ... m /59EF0E5B9A2CC < 00001c30: ... 4AEC9FD90E7BBE23 < 00001c40: ... 0CC.>>.startxref --- > 00001bc0: ... 13 0 R./ID [ <7D > 00001bd0: ... 1441609E44A5446A > 00001be0: ... 8A0F9A4E96FF49>. > 00001bf0: ... <7D1441609E44A54 > 00001c00: ... 46A8A0F9A4E96FF4 > 00001c10: ... 9> ]./DocChecksu > 00001c20: ... m /A7A3CD305537E > 00001c30: ... B3DC35BA5EB4678F > 00001c40: ... EDA.>>.startxref
The text DocChecksum
jumps out. This looks like a 32-bit check sum. If I had to guess, I'd say it's probably CRC-32. And apparently there's some sort of 32-bit hash before the checksum: CB4 ...B6A
in humpty.pdf
and 7D1...F49
in dumpty.pdf
. This must be some sort of hash. The hash is repeated twice in each file. Maybe this is some sort of versioning information, and the hash is repeated because the initial and final versions of the file are the same.
The fact that the files were saved 17 seconds apart changed two bytes in the timestamps. But changing these two bytes caused the two 32-byte hash codes to change.
I wonder if Pandoc can create identical PDFs – https://pandoc.org/MANUAL.html#reproducible-builds
I’ll have to try it…
No man ever creates the same pdf twice, for it’s not the same pdf and he’s not the same man.
— Heraclitus