Do comments in a LaTeX file change the output?

When you add a comment to a LaTeX file, it makes no visible change to the output. The comment is ignored as far as the appearance of the file. But is that comment somehow included in the file anyway?

If you compile a LaTeX file to PDF, then edit it by throwing in a comment, and compile again, your two files will differ. As I wrote about earlier, the time that a file is created is embedded in a PDF. That time stamp is also included in two or three hashes, so the files will differ by more than just the bits in the time stamp.

But even if you compile two files at the same time (within the resolution of the time stamp, which is one second), the PDF files will still differ. Apparently some kind of hash of the source file is included in the PDF.

So suppose you have two files. The content of foo.tex is

    \documentclass{article}
    \begin{document}
    Hello world.
    \end{document}

and the content of bar.tex is

    \documentclass{article}
    \begin{document}
    Hello world. % comment
    \end{document}

then the output of running pdflatex on both files will look the same.

Suppose you compile the files at the same time so that the time stamps are the same.

    pdflatex foo.tex && pdflatex bar.tex

It’s possible that the two time stamps could be different, one file compiling a little before the tick of a new second and one compiling a little after. But if your computer is fast enough and you don’t get unlucky, the time stamps will be the same.

Then you can compare hex dumps of the two PDF files with

    diff  <(xxd foo.pdf) <(xxd bar.pdf)

This produces the following

    < ...  ./ID [<F12AF1442
    < ...  E03CC6B3AB64A5D9
    < ... 8DEE2FE> <F12AF1
    < ...  442E03CC6B3AB64A
    < ... 5D98DEE2FE>]./Le
    --
    > ...  ./ID [<4FAA0E9F1
    > ...  CC6EFCC5068F481E
    > ...  0419AD6> <4FAA0E
    > ...  9F1CC6EFCC5068F4
    > ...  81E0419AD6>]./Le

You can’t recover the comment from the binary dump, but you can tell that the files differ.

I don’t know what hash is being used. My first guess was MD5, but that’s not it. It’s a 128-bit hash, so that rules out newer hashes like SHA256. I tried searching for it but didn’t find anything. If you know what hash pdflatex uses, please let me know.

LaTeX will also let you add text at the end of the file, after the \end{document} command. This also will change the hash code but will not change the appearance of the output.

Related posts

3 thoughts on “Do comments in a LaTeX file change the output?

  1. In case you’re curious, this is from the PDF specification (1.7, document ID PDF 32000-1:2008):

    File identifiers shall be defined by the optional ID entry in a PDF file’s trailer dictionary (see 7.5.5, “File Trailer”). The ID entry is optional but should be used. The value of this entry shall be an array of two byte strings. The first byte string shall be a permanent identifier based on the contents of the file at the time it was originally created and shall not change when the file is incrementally updated. The second byte string shall be a changing identifier based on the file’s contents at the time it was last updated. When a file is first written, both identifiers shall be set to the same value. If both identifiers match when a file reference is resolved, it is very likely that the correct and unchanged file has been found. If only the first identifier matches, a different version
    of the correct file has been found.

    To help ensure the uniqueness of file identifiers, they should be computed by means of a message digest algorithm such as MD5…using the following information:
    • The current time
    • A string representation of the file’s location, usually a pathname
    • The size of the file in bytes
    • The values of all entries in the file’s document information dictionary…
    NOTE The calculation of the file identifier need not be reproducible; all that matters is that the identifier is likely to be unique. For example, two implementations of the preceding algorithm might use different formats for the current time, causing them to produce different file identifiers for the same file created at the same time, but the
    uniqueness of the identifier is not affected.

    I think it likely that pdflatex is indeed using MD5, but you got different results because of how you invoked it vs. how pdflatex is invoking it.

  2. The comments at https://tex.stackexchange.com/questions/229605/reproducible-latex-builds-compile-to-a-file-which-always-hashes-to-the-same-va seem relevant. That suggests you look at \pdftrailerid, which defaults to a value based on input file name and starting date/time.

    It points to the PDF spec 10.4 at https://web.archive.org/web/20140726133949/https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf#G15.2261661 “The value of this entry is an array of two byte strings. The first byte string is a permanent identifier based on the contents of the file at the time it was originally created and does not change when the file is incrementally updated. The second byte string is a changing identifier based on the file’s contents at the time it was last updated.”

    It recommends an MD5 hash of the current time; a string representation of the file’s location, usually a pathname; The size of the file in bytes; and the values of all entries in the file’s document information dictionary.

    The pdf privacy package documentation says “The pdf trailer id is kept by default because it is optional but strongly recommended by the pdf standard. Not including this entry could break some workflows that rely on the trailer ID to uniquely identify files.”

  3. As it turns out, by specifying “\pdftrailerid{}”, any comments added in the tex file no longer impact the content of the resulting PDF file.

    Regarding the hash used by pdflatex, I believe it’s tied to the “trailer id” generated by pdftex. The relevant code resides within the “printID” function, accessible at https://tug.org/svn/texlive/trunk/Build/source/texk/web2c/pdftexdir/utils.c?revision=57915&view=markup#l673.

    Given that Hàn Thế Thành (https://www.tug.org/interviews/thanh.html) is the mastermind behind pdftex, reaching out to him directly might provide insights. I found his email hanthethanh@gmail.com in the github mirror repository named texlive-source.

Comments are closed.