Choosing a Computer Language for a Project

Julia. Scala. Lua. TypeScript. Haskell. Go. Dart. Various computer languages new and old are sometimes proposed as better alternatives to mainstream languages. But compared to mainstream choices like Python, C, C++ and Java (cf. Tiobe Index)—are they worth using?

Certainly it depends a lot on the planned use: is it a one-off small project, or a large industrial-scale software application?

Yet even a one-off project can quickly grow to production-scale, with accompanying growing pains. Startups sometimes face a growth crisis when the nascent code base becomes unwieldy and must be refactored or fully rewritten (or you could do what Facebook/Meta did and just write a new compiler to make your existing code base run better).

The scope of different types of software projects and their requirements is so incredibly diverse that any single viewpoint from experience runs a risk of being myopic and thus inaccurate for other kinds of projects. With this caveat, I’ll share some of my own experience from observing projects for many dozens of production-scale software applications written for leadership-scale high performance computing systems. These are generally on a scale of 20,000 to 500,000 lines of code and often require support of mathematical and scientific libraries and middleware for build support, parallelism, visualization, I/O, data management and machine learning.

Here are some of the main issues regarding choice of programming languages and compilers for these codes:

1. Language and compiler sustainability. While the lifetime of computing systems is measured in years, the lifetime of an application code base can sometimes be measured in decades. Is the language it is written in likely to survive and be well-supported long into the future? For example, Fortran, though still used and frequently supported, is is a less common language thus requiring special effort from vendors, with fewer developer resources than more popular languages. Is there a diversity of multiple compilers from different providers to mitigate risk? A single provider means a single point of failure, a high risk; what happens if the supplier loses funding? Are the language and compilers likely to be adaptable for future computer hardware trends (though sometimes this is hard to predict)? Is there a large customer base to help ensure future support? Similarly, is there an adequate pool of available programmers deeply skilled in the language? Does the language have a well-featured standard library ecosystem and good support for third-party libraries and frameworks? Is there good tool support (debuggers, profilers, build tools)?

2. Related to this is the question of language governance. How are decisions about future updates to the language made? Is there broad participation from the user community and responsiveness to their needs? I’ve known members of the C++ language committee; from my limited experience they seem very reasonable and thoughtful about future directions of the language. On the other hand, some standards have introduced features that scarcely anyone ever uses—a waste of time and more clutter for the standard.

3. Productivity. It is said that programmer productivity is limited by the ability of a few lines of code to express high level abstractions that can do a lot with minimal syntax. Does the language permit this? Does the language standard make sense (coherent, cohesive) and follow the principle of least surprise? At the same time, the language should not engulf what might better be handled modularly by a library. For example, a matrix-matrix product that is bound up with the language might be highly productive for simple cases but have difficulty supporting the many variants of matrix-matrix product provided for example by the NVIDIA CUTLASS library. Also, in-language support for CUDA GPU operations, for example, would make it hard for the language not to lag behind in support of the frequent new releases of CUDA.

4. Strategic advantage. The 10X improvement rule states that an innovation is only worth adopting if it offers 10X improvement compared to existing practice according to some metric . If switching to a given new language doesn’t bring significant improvement, it may not be worth doing. This is particularly true if there is an existing code base of some size. A related question is whether the new language offers an incremental transition path for an existing code to the new language (in many cases this is difficult or impossible).

5. Performance (execution speed). Does the language allow one to get down to bare-metal performance rather than going through costly abstractions or an interpreter layer? Are the features of the underlying hardware exposed for the user to access? Is performance predictable? Can one get a sense of the performance of each line of code just by inspection, or is this occluded by abstractions or a complex compilation process? Is the use of just-in-time compilation or garbage collection unpredictable, which could be a problem for parallel computing wherein unexpected “hangs” can be caused by one process unexpectedly performing one of these operations? Do the compiler developers provide good support for effective and accurate code optimization options? Have results from standardized non-cherry-picked benchmarks been published (kernel benchmarks, proxy apps, full applications)?

Early adopters provide a vibrant “early alert” system for new language and tool developments that are useful for small projects and may be broadly impactful. Python was recognized early in the scientific computing community for its potential complementary use with standard languages for large scientific computations. When it comes to planning large-scale software projects, however, a range of factors based on project requirements must be considered to ensure highest likelihood of success.

 

More is less

When I first started using Unix, I used a program called “more” to read files. The name makes sense because each time you press the space bar, more will show you more of your file, one screen at a time.

Now everyone uses less, and more is all but forgotten.

Daniel Halbert wrote more in 1978. Mark Nudelman a similar program with more functionality in 1984 which he named less. The name was a pun on the aphorism “less is more” [1]. Soon less completely replaced more.

I’m curious why I ever used more, since less had taken over before I touched Unix. One possibility is that someone who was accustomed to more showed me that command. Another possibility is that I learned it from reading The Unix Programming Environment which came out in November 1983. It includes more but not less.

My laptop contains executables for more and less in /usr/bin. The command

    diff less more

returns nothing, indicating that the binaries are identical: less literally is more.

My desktop has distinct binaries for less and more. The more binary is much smaller, and so I assume it is limited to the original functionality of more, more or less.

Related posts

[1] I don’t know who coined the phrase “less is more,” but it is associated with architect Ludwig Mies van der Rohe (1886–1969) who often quoted it. He did not apply the principle to is own name, however. He was born Ludwig Mies and later appended van der Rohe.

Distance from a point to a line

Eric Lengyel’s new book Projective Geometric Algebra Illuminated arrived yesterday and I’m enjoying reading it. Imagine if someone started with ideas like dot products, cross products, and determinants that you might see in your first year of college, then thought deeply about those things for years. That’s kinda what the book is.

Early in the book is the example of finding the distance from a point q to a line of the form p + tv.

If you define u = q − p then a straightforward derivation shows that the distance d from q to the line is given by

d = \sqrt{{\bold u}^2 - \frac{({\bold u}\cdot {\bold v})^2}{{\bold v}^2}}

But as the author explains, it is better to calculate d by

d = \sqrt{\frac{({\bold u}\times {\bold v})^2}{{\bold v}^2}}

Why is that? The two expressions are algebraically equal, but the latter is better suited for numerical calculation.

The cardinal rule of numerical calculation is to avoid subtracting nearly equal floating point numbers. If two numbers agree to b bits, you may lose up to b bits of significance when computing their difference.

If u and v are vectors with large magnitude, but q is close to the line, then the first equation subtracts two large, nearly equal numbers under the square root.

The second equation involves subtraction too, but it’s less obvious. This is a common theme in numerical computing. Imagine this dialog.

[Student produces first equation.]

Mentor: Avoid subtracting nearly equal numbers.

[Student produces section equation.]

Student: OK, did it.

Mentor: That’s much better, though it could still have problems.

Where is there a subtraction in the second equation? We started with a subtraction in defining u. More subtly, the definition of cross product involves subtractions. But these subtractions involve smaller numbers than the first formula, because the first formula subtracts squared values. Eric Lengyel points this out in his book.

None of this may matter in practice, until it does matter, which is a common pattern in numerical computing. You implement something like the first formula, something that can be derived directly. You implicitly have in mind vectors whose magnitude is comparable to d and this guides your choice of unit tests, which all pass.

Some time goes by and a colleague tells you your code is failing. Impossible! You checked your derivation by hand and in Mathematica. Your unit tests all pass. Must be your colleague’s fault. But it’s not. Your code would be correct in infinite precision, but in an actual computer it fails on inputs that violate your implicit assumptions.

This can be frustrating, but it can also be fun. Implementing equations from a freshman textbook accurately, efficiently, and robustly is not a freshman-level exercise.

Related posts

Experiences with Thread Programming in Microsoft Windows

Lately I’ve been helping a colleague to add worker threads to his GUI-based Windows application.

Thread programming can be tricky. Here are a few things I’ve learned along the way.

Performance. This app does compute-intensive work. It is helpful to offload this very compute-heavy work to a worker thread. Doing this frees the main thread to service GUI requests better.

Thread libraries. Windows has multiple thread libraries, for example Microsoft Foundation Class library threads and C++ standard library threads. It is hazardous to use different thread libraries in the same app. In the extreme case, different thread libraries, such as GOMP  vs. LOMP, used in resp. the GCC and LLVM compiler families, have different threading runtimes which keep track of threads in different ways. Mixing them in the same code can cause hazardous silent errors.

Memory fences are a thing. Different threads can run on different processor cores and hold variables in different respective L1 caches that are not flushed (this to maintain high performance). An update to a variable by one thread is not guaranteed to be visible to other threads without special measures. For example, one could safely transfer information using ::PostMessage coupled with a handler function on the receiver thread. Or one could send a signal using an MFC CEvent on one thread and read its Lock on the other. Also, a thread launch implicitly does a memory fence, so that, at least then, the new thread is guaranteed to correctly see the state of all memory locations.

GUI access should be done from the master thread only, not a worker thread. Doing so can result in deadlock. A worker thread can instead ::PostMessage to ask the master thread to do a GUI action.

Thread launch. By default AfxBeginThread returns a thread handle which MFC takes care of deleting when no longer needed. If you want to manage the life cycle of the handle yourself, you can do something like:

myWorkerThreadHandle = AfxBeginThread(myFunc, myParams,
  THREAD_PRIORITY_NORMAL, 0, CREATE_SUSPENDED);
myWorkerThreadHandle->m_bAutoDelete = false;
myWorkerThreadHandle->ResumeThread();

Joint use of a shared library like the DAO database library has hazards. One should beware of using the library to allocate something in one thread and deallocating in another, as this will likely allocate in a thread-local heap or stack instead of a shared thread-safe heap, this resulting in a crash.

Initialization. One should call CoInitializeEx(NULL, COINIT_APARTMENTTHREADED) and AfxDaoInit() (if using DAO) at thread initialization on both master and worker threads, and correspondingly CoUninitialize() and AfxDaoTerm() at completion of the thread.

Monitoring of thread state can be done with
WaitForSingleObject(myWorkerThreadHandle->m_hThread, 0) to determine if the thread has completed or WaitForSingleObject(myWorkerThreadHandle->m_hThread, INFINITE) for a blocking wait until completion.

Race conditions are always a risk but can be avoided by careful reasoning about execution. Someone once said up to 90% of code errors can be found by desk checking [1]. Race conditions are notoriously hard to debug, partly because they can occur nondeterministically. There are tools for trying to find race condition errors, though I’ve never tried them.

So far I find no rigorous specification of the MFC threading model online that touches on all these concerns. Hopefully this post is useful to someone else working through these issues.

References

[1] Dasso, Aristides., Funes, Ana. Verification, Validation and Testing in Software Engineering. United Kingdom: Idea Group Pub., 2007, p. 163.

Archiving data on paper

This is a guest post by Ondřej Čertík. Ondřej formerly worked at Los Alamos National Labs and now works for GSI Technologies. He is known in the Python community for starting the SymPy project and in the Fortran community for starting LFortran. — John

 

Ondřej Čertík

I finally got to experiment a bit with archiving data on a regular A4 or US Letter page using a regular printer and a phone camera to read it. It’s been bothering me for about 10 years. What is the maximum data size that we can store on the paper and reliably retrieve?

My setup

It seems it is limited by my camera, not my printer. This image stores 30KB of data. I printed it, took a photo with my iPhone, and then decoded and unpacked it using tar and bzip2. It correctly unpacks into 360KB of C++ files, 6305 lines.

I am using the Twibright Optar software.

It uses Golay codes (used by Voyager) that can fix 3 bad bits in each 24-bit code word (which contains 12-bit payload, and 12-bit parity bits). If there are 4 bad bits, the errors are detected, but cannot be corrected. In my experiment, there were 6 codewords which had 3 bad bits, and no codewords with more bad bits, so there was no data loss. It seems most of the bad bits were in the area where my phone cast a shadow on the paper, so possibly retaking the picture in broad daylight might help.

Bad bits

Here are the stats from the optar code:

7305 bits bad from 483840, bit error rate 1.5098%.

49.1855% black dirt, 50.8145% white dirt and 0 (0%) irreparable. Golay stats
===========
0 bad bits      13164
1 bad bit       6693
2 bad bits      297
3 bad bits      6
4 bad bits      0
total codewords 20160

The original setup can store 200KB of data. I tried it, it seems to print fine. But it’s really tiny, and my iPhone camera can’t read it well enough, so nothing is recovered. So I used larger pixels. The 30KB is what I was able to store and retrieve and from the stats above, it seems this is the limit.

Other possibilities

Competing products, such as this one only store 3-4KB/page, so my experiment above is 10x that.

One idea is to use a different error correction scheme. The Golay above only uses 50% to store data. If we could use 75% to store data, that gives us 50% more capacity.

Another idea to improve is to use colors, with 8 colors we get 3x larger capacity, with 16 colors we get 4x more. That would give us around 100KB/page with colors. Can we do better?

Comparison with other storage techniques

I think the closest to compare is floppy disks. The 5¼-inch floppy disk that I used as a kid could store 180 KB single side, 360KB both sides. The 3½-inch floppy disk could store 720 KB single side or 1.44MB both sides. We can also print on both sides of the paper, so let’s just compare single side for both:

  • Optar original (requires a good scanner): 200 KB
  • My experiment above (iPhone): 30 KB
  • Estimate with 8 colors: 120 KB
  • 5¼-inch floppy disk: 180 KB
  • 3½-inch floppy disk: 720 KB

It seems that the 5¼-inch floppy disk is a good target.

Applications

One application is to store this at the end of a book, so that you don’t need to distribute CDs or floppy disks (as used to be the habit), but you just put 10-20 pages to the appendix, this should be possible to decode even 100 years from now quite reliably.

Another application is just archiving any projects that I care about. I still have some floppy disks in my parents’ attic, and I am quite sure they are completely unreadable by now. While I also have some printouts on paper from the same era 30 years ago, and they are perfectly readable.

I think the requirement to use a regular iPhone is good (mine is a few years old, perhaps the newest one can do better?). If we allow scanners, then of course we can do better, but not many people have a high quality scanner at home, and there is no limit to it: GitHub’s Arctic Vault uses microfilm and high quality scanner. I want something that can be put into a book, or article on paper, something that anyone can decode with any phone, and that doesn’t require any special treatment to store.

What’s the Best Code Editor?

Emacs, vi, TextEdit, nano, Sublime, Notepad, Wordpad, Visual Studio, Eclipse, etc., etc.—everyone’s got a favorite.

I used Visual Studio previously and liked the integrated debugger. Recently I started using VS again and found the code editing windows rather cluttered. Thankfully you can tone this down, if you can locate the right options.

Eclipse for Java has instantaneous checking for syntax errors. I have mixed feelings on this. Perhaps you could type a little more code before getting a glaring error message?

Concerning IDEs (integrated development environments) like this—I’ve met people who think that a full GUI-based IDE is the only way to go. Maybe so. However , there’s another view.

You’d think if anyone would know how to write code quickly, accurately and effectively, it would be world-class competitive programmers. They’re the best, right?

One of the very top people is Gennady Korotkevich. He’s won many international competitions.

What does he use? Far Manager, a text-based user interface tool with a mere two panels and command prompt. It’s based on 1980s pre-GUI file manager methodologies that were implemented under DOS.

It reminds me of a conversation I had with our admin when I was in grad school. I asked, “Why do you use vi instead of MS Word for editing documents?” Answer: “I like vi because it’s faster—your fingers never need to leave the keyboard.”

Admittedly, not all developer workflows would necessarily find this approach optimal. But still it makes you think. Sometimes the conventional answer is not the best one.

Do you have a favorite code editor? Please let us know in the comments.

Natural one-liners

I learned to use Unix in college—this was before Linux—but it felt a little mysterious. Clearly it was developed by really smart people, but what were the problems that motivated their design choices?

Some of these are widely repeated. For example, commands have terse names because you may have to transmit commands over a glacial 300 baud network connection.

OK, but why are there so many tools for munging text files, for example? That’s great if your job requires munging text files, but what about everything else. What I didn’t realize at the time was that nearly everything involves munging text files, or can be turned into a problem involving munging text files.

Working with data at the command line

There’s an old joke that Unix is user friendly, it’s just picky about who its friends are. I’d rephrase to say Unix makes more sense when you’re doing the kind of work the Unix developers were doing.

I was writing programs when I learned Unix, so some things about Unix made sense at the time. But I didn’t see the motivation for many of the standard command line tools until I started analyzing datasets years later. I thought awk was cool—it was the first scripting language I encountered—but it wasn’t until years later that I realized awk is essentially a command line spreadsheet. It was written for manipulating tabular data stored in text files.

Mythological scripts

Unix one-liners are impressive, but they can seem like a rabbit out of a hat. How would anyone think to do that?

When you develop your own one liners, one piece at a time, they seem much more natural. You get a feel for how the impressive one-liners you see on display were developed incrementally. They almost certainly did not pop into the world fully formed like Athena springing from the head of Zeus.

Example: Voter registration data

Here’s an example. I was looking at Washington state voter registration data. There’s a file 20240201_VRDB_Extract.txt. What’s in there?

The first line of a data file often contains column headers. Looking at just the first few lines of a file is a perennial task, so there’s a tool for that: head. By default it shows the first 10 lines of a file. We just want to see the first line, and there’s an option for that: -n 1.

> head -n 1 20240201_VRDB_Extract.txt

StateVoterID|FName|MName|LName|NameSuffix|Birthyear|Gender|RegStNum|RegStFrac|RegStName|RegStType|RegUnitType|RegStPreDirection|RegStPostDirection|RegStUnitNum|RegCity|RegState|RegZipCode|CountyCode|PrecinctCode|PrecinctPart|LegislativeDistrict|CongressionalDistrict|Mail1|Mail2|Mail3|MailCity|MailZip|MailState|MailCountry|Registrationdate|LastVoted|StatusCode

Inserting line breaks

OK, those look like column headers, but they’re hard to read. It would be nice if we could replace all the pipe characters used as field separators with line breaks. There’s a command for that too. The sed tool let’s you, among other things, replace one string with another. The tiny sed program

s/|/\n/g

does just what we want. It may look cryptic, but it’s very straight forward. The “s” stands for substitute. The program s/foo/bar/ substitutes the first instance of foo with bar. If you want to replace all instances, you tack on a “g” on the end for “global.”

Eliminating temporary files

We could save our list of column headings to a file, and then run sed on the output, but that creates an unnecessary temporary file. If you do this very much, you get a lot of temporary files cluttering your working area, say with names like temp1 and temp2. Then after a while you start to forget what you named each intermediary file.

It would be nice if you could connect your processing steps together without having to create intermediary files. And that’s just what pipes do. So instead of saving our list of column headers to a file, we pipe it through to sed.

> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g'

FName
MName
LName
NameSuffix
Birthyear
...

Scrolling and searching

This is much better. But it produces more output than you may be able to see in your terminal. You could see the list, one terminal window at a time, by piping the output to less.

> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g' | less

This file only has 33 columns, but it’s not uncommon for a data file to have hundreds of columns. Suppose there were more columns than you wanted to scan through, and you wanted to know whether one of the columns contained a zip code. You could do that by piping the output through grep to look for “zip.”

> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g' | grep -i zip

There are no column headings containing “zip”, but there are a couple containing “Zip.” Adding -i (for case insensitive) finds the zip code columns.

RegZipCode
MailZip

Our modest little one-liner now has three segments separated by pipes. It might look impressive to someone new to working this way, but it’s really just stringing common commands together in a common way.

A famous one-liner

When you see a more complicated one-liner like

tr -cs A-Za-z '
' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

you can imagine how it grew incrementally. Incidentally, the one-liner above is somewhat famous. You can find the story behind it here.

Related posts

Avoiding Multiprocessing Errors in Bash Shell

 

Suppose you have two Linux processes trying to modify a file at the same time and you don’t want them stepping on each other’s work and making a mess.  A common solution is to use a “lock” mechanism (a.k.a. “mutex”). One process “locks the lock” and by this action has sole ownership of a resource in order to make updates, until it unlocks the lock to allow other processes access.

Writing a custom lock in Linux bash shell is tricky. Here’s an example that DOESN’T work right:

#/bin/bash
let is_locked=1 # helper variable to denote locked state
mylockvariable=$(cat mylockfile 2>/dev/null)  # read the lock
while [ "$mylockvariable" != $is_locked ]  # loop until unlocked
do
    sleep 5 # wait 5 seconds to try again 
    mylockvariable=$(cat mylockfile 2>/dev/null)  # read again
done
echo $is_locked > mylockfile  # lock the lock
# >>> do critical work safely here <<<
# >>> ERROR: NOT SAFE <<<
rm mylockfile  # unlock the lock

Here the lock value is stored in a shared resource, the file “mylockfile”. If the file exists and contains the character “1”, the lock is considered locked; otherwise, it is considered unlocked.  The code will loop until the lock is unlocked, then acquire the lock, do the required single-process work, and then release the lock.

However, this code can fail without warning: suppose two processes A and B execute this code concurrently. Initially the lock is in an unlocked state. Process A reads the lockfile. Then suppose immediately after this, Process A is temporarily interrupted, perhaps to give CPU cycles to run Process B. Then, suppose Process B begins, reads the lock, locks the lock and starts doing its critical work. Suppose now Process B is put into wait state and Process A is restarted. Process A, since it previously read the lockfile, wrongly believes the lock is unlocked, thus proceeds to also lock the lock and do the critical work—resulting in a mess.

This is an example of a classic race condition, in which the order of execution of threads or processes can affect the final outcome of execution.

A solution to this conundrum is found in the excellent book, Unix Power Tools [1,2]. This is a hefty tome but very accessibly written, for some people well worth a read-through to pick up a slew of time-saving tips.

The problem with the example code is the need to both read and set the lock in a single, indivisible (atomic) operation. Here’s a trick to do it:

#/bin/bash
until (umask 222; echo > mylockfile) 2>/dev/null  # check and lock
do  # keep trying if failed
    sleep 5 # wait 5 seconds to try again 
done
# >>> do critical work safely here <<<
rm -f mylockfile  # unlock the lock

Here, the existence of the lockfile itself is the indicator that the lock is set. Setting the umask makes this file creation fail if the file already exists, triggering the loop to activate to keep trying. This works because the existence of a file can either be true or false and nothing else; the existence of a file is guaranteed atomicity by the OS and the filesystem. Thus, assuming the system is working correctly, this code is guaranteed to produce the desired behavior.

Race conditions can be a nuisance to find since their occurrence is nondeterministic and can be rare but devastating. Writing correct code for multiple threads of execution can be confusing to those who haven’t done it before. But with experience it becomes easier to reason about correctness and spot such errors.

References:

[1] Peek, Jerry D., Shelley Powers, Tim O’Reilly and Mike Loukides. “Unix Power Tools, Third Edition.” (2002), https://learning.oreilly.com/library/view/unix-power-tools/0596003307/

[2] https://learning.oreilly.com/library/view/unix-power-tools/0596003307/ch36.html#:-:text=Shell%20Lockfile

Do comments in a LaTeX file change the output?

When you add a comment to a LaTeX file, it makes no visible change to the output. The comment is ignored as far as the appearance of the file. But is that comment somehow included in the file anyway?

If you compile a LaTeX file to PDF, then edit it by throwing in a comment, and compile again, your two files will differ. As I wrote about earlier, the time that a file is created is embedded in a PDF. That time stamp is also included in two or three hashes, so the files will differ by more than just the bits in the time stamp.

But even if you compile two files at the same time (within the resolution of the time stamp, which is one second), the PDF files will still differ. Apparently some kind of hash of the source file is included in the PDF.

So suppose you have two files. The content of foo.tex is

    \documentclass{article}
    \begin{document}
    Hello world.
    \end{document}

and the content of bar.tex is

    \documentclass{article}
    \begin{document}
    Hello world. % comment
    \end{document}

then the output of running pdflatex on both files will look the same.

Suppose you compile the files at the same time so that the time stamps are the same.

    pdflatex foo.tex && pdflatex bar.tex

It’s possible that the two time stamps could be different, one file compiling a little before the tick of a new second and one compiling a little after. But if your computer is fast enough and you don’t get unlucky, the time stamps will be the same.

Then you can compare hex dumps of the two PDF files with

    diff  <(xxd foo.pdf) <(xxd bar.pdf)

This produces the following

    < ...  ./ID [<F12AF1442
    < ...  E03CC6B3AB64A5D9
    < ... 8DEE2FE> <F12AF1
    < ...  442E03CC6B3AB64A
    < ... 5D98DEE2FE>]./Le
    --
    > ...  ./ID [<4FAA0E9F1
    > ...  CC6EFCC5068F481E
    > ...  0419AD6> <4FAA0E
    > ...  9F1CC6EFCC5068F4
    > ...  81E0419AD6>]./Le

You can’t recover the comment from the binary dump, but you can tell that the files differ.

I don’t know what hash is being used. My first guess was MD5, but that’s not it. It’s a 128-bit hash, so that rules out newer hashes like SHA256. I tried searching for it but didn’t find anything. If you know what hash pdflatex uses, please let me know.

LaTeX will also let you add text at the end of the file, after the \end{document} command. This also will change the hash code but will not change the appearance of the output.

Related posts

If you save a file as PDF twice, you get two different files

If you save a file as a PDF twice, you won’t get exactly the same file both times. To illustrate this, I created an LibreOffice document containing “Hello world.” and saved it twice, first as humpty.pdf then as dumpty.pdf. Then I compared the two files.

    % diff humpty.pdf dumpty.pdf
    Binary files humpty.pdf and dumpty.pdf differ

That’s curious. Both files are the same size, 7260 bytes, but something is different inside.

Next I dumped both files to hexadecimal and compared the output.

.   % diff  <(xxd humpty.pdf) <(xxd dumpty.pdf)

This produced two ranges of differences. Here's the first:

    < 00001a60: ... 3232 ...  064322-06'00')>>
    ---
    > 00001a60: ... 3339 ...  064339-06'00')>>

The files differ in two consecutive bytes. The ASCII representation at the end of the lines shows what these bytes mean. Apparently these two bytes are part of a time stamp. The first was produced at 6:43:22 this morning and the second was produced 17 seconds later at 6:43:39.

There's another block of differences further down the file. I'll leave out the hex representation of the bytes to save space and just include the positions and the ASCII representation.

    < 00001bc0: ...  13 0 R./ID [ <CB
    < 00001bd0: ...  4185E1FB366E0C64
    < 00001be0: ...  D65ADF317ACB6A>.
    < 00001bf0: ...  <CB4185E1FB366E0
    < 00001c00: ...  C64D65ADF317ACB6
    < 00001c10: ...  A> ]./DocChecksu
    < 00001c20: ...  m /59EF0E5B9A2CC
    < 00001c30: ...  4AEC9FD90E7BBE23
    < 00001c40: ...  0CC.>>.startxref
    ---
    > 00001bc0: ...  13 0 R./ID [ <7D
    > 00001bd0: ...  1441609E44A5446A
    > 00001be0: ...  8A0F9A4E96FF49>.
    > 00001bf0: ...  <7D1441609E44A54
    > 00001c00: ...  46A8A0F9A4E96FF4
    > 00001c10: ...  9> ]./DocChecksu
    > 00001c20: ...  m /A7A3CD305537E
    > 00001c30: ...  B3DC35BA5EB4678F
    > 00001c40: ...  EDA.>>.startxref

The text DocChecksum jumps out. This looks like a 32-bit check sum. If I had to guess, I'd say it's probably CRC-32. And apparently there's some sort of 32-bit hash before the checksum: CB4 ...B6A in humpty.pdf and 7D1...F49 in dumpty.pdf. This must be some sort of hash. The hash is repeated twice in each file. Maybe this is some sort of versioning information, and the hash is repeated because the initial and final versions of the file are the same.

The fact that the files were saved 17 seconds apart changed two bytes in the timestamps. But changing these two bytes caused the two 32-byte hash codes to change.

Related posts