Distance from a point to a line

Eric Lengyel’s new book Projective Geometric Algebra Illuminated arrived yesterday and I’m enjoying reading it. Imagine if someone started with ideas like dot products, cross products, and determinants that you might see in your first year of college, then thought deeply about those things for years. That’s kinda what the book is.

Early in the book is the example of finding the distance from a point q to a line of the form p + tv.

If you define u = q − p then a straight-forward derivation shows that the distance d from q to the line is given by

d = \sqrt{{\bold u}^2 - \frac{({\bold u}\cdot {\bold v})^2}{{\bold v}^2}}

But as the author explains, it is better to calculate d by

d = \sqrt{\frac{({\bold u}\times {\bold v})^2}{{\bold v}^2}}

Why is that? The two expressions are algebraically equal, but the latter is better suited for numerical calculation.

The cardinal rule of numerical calculation is to avoid subtracting nearly equal floating point numbers. If two numbers agree to b bits, you may lose up to b bits of significance when computing their difference.

If u and v are vectors with large magnitude, but q is close to the line, then the first equation subtracts two large, nearly equal numbers under the square root.

The second equation involves subtraction too, but it’s less obvious. This is a common theme in numerical computing. Imagine this dialog.

[Student produces first equation.]

Mentor: Avoid subtracting nearly equal numbers.

[Student produces section equation.]

Student: OK, did it.

Mentor: That’s much better, though it could still have problems.

Where is there a subtraction in the second equation? We started with a subtraction in defining u. More subtly, the definition of cross product involves subtractions. But these subtractions involve smaller numbers than the first formula, because the first formula subtracts squared values. Eric Lengyel points this out in his book.

None of this may matter in practice, until it does matter, which is a common pattern in numerical computing. You implement something like the first formula, something that can be derived directly. You implicitly have in mind vectors whose magnitude is comparable to d and this guides your choice of unit tests, which all pass.

Some time goes by and a colleague tells you your code is failing. Impossible! You checked your derivation by hand and in Mathematica. Your unit tests all pass. Must be your colleague’s fault. But it’s not. Your code would be correct in infinite precision, but in an actual computer it fails on inputs that violate your implicit assumptions.

This can be frustrating, but it can also be fun. Implementing equations from a freshman textbook accurately, efficiently, and robustly is not a freshman-level exercise.

Related posts

Experiences with Thread Programming in Microsoft Windows

Lately I’ve been helping a colleague to add worker threads to his GUI-based Windows application.

Thread programming can be tricky. Here are a few things I’ve learned along the way.

Performance. This app does compute-intensive work. It is helpful to offload this very compute-heavy work to a worker thread. Doing this frees the main thread to service GUI requests better.

Thread libraries. Windows has multiple thread libraries, for example Microsoft Foundation Class library threads and C++ standard library threads. It is hazardous to use different thread libraries in the same app. In the extreme case, different thread libraries, such as GOMP  vs. LOMP, used in resp. the GCC and LLVM compiler families, have different threading runtimes which keep track of threads in different ways. Mixing them in the same code can cause hazardous silent errors.

Memory fences are a thing. Different threads can run on different processor cores and hold variables in different respective L1 caches that are not flushed (this to maintain high performance). An update to a variable by one thread is not guaranteed to be visible to other threads without special measures. For example, one could safely transfer information using ::PostMessage coupled with a handler function on the receiver thread. Or one could send a signal using an MFC CEvent on one thread and read its Lock on the other. Also, a thread launch implicitly does a memory fence, so that, at least then, the new thread is guaranteed to correctly see the state of all memory locations.

GUI access should be done from the master thread only, not a worker thread. Doing so can result in deadlock. A worker thread can instead ::PostMessage to ask the master thread to do a GUI action.

Thread launch. By default AfxBeginThread returns a thread handle which MFC takes care of deleting when no longer needed. If you want to manage the life cycle of the handle yourself, you can do something like:

myWorkerThreadHandle = AfxBeginThread(myFunc, myParams,
  THREAD_PRIORITY_NORMAL, 0, CREATE_SUSPENDED);
myWorkerThreadHandle->m_bAutoDelete = false;
myWorkerThreadHandle->ResumeThread();

Joint use of a shared library like the DAO database library has hazards. One should beware of using the library to allocate something in one thread and deallocating in another, as this will likely allocate in a thread-local heap or stack instead of a shared thread-safe heap, this resulting in a crash.

Initialization. One should call CoInitializeEx(NULL, COINIT_APARTMENTTHREADED) and AfxDaoInit() (if using DAO) at thread initialization on both master and worker threads, and correspondingly CoUninitialize() and AfxDaoTerm() at completion of the thread.

Monitoring of thread state can be done with
WaitForSingleObject(myWorkerThreadHandle->m_hThread, 0) to determine if the thread has completed or WaitForSingleObject(myWorkerThreadHandle->m_hThread, INFINITE) for a blocking wait until completion.

Race conditions are always a risk but can be avoided by careful reasoning about execution. Someone once said up to 90% of code errors can be found by desk checking [1]. Race conditions are notoriously hard to debug, partly because they can occur nondeterministically. There are tools for trying to find race condition errors, though I’ve never tried them.

So far I find no rigorous specification of the MFC threading model online that touches on all these concerns. Hopefully this post is useful to someone else working through these issues.

References

[1] Dasso, Aristides., Funes, Ana. Verification, Validation and Testing in Software Engineering. United Kingdom: Idea Group Pub., 2007, p. 163.

Archiving data on paper

This is a guest post by Ondřej Čertík. Ondřej formerly worked at Los Alamos National Labs and now works for GSI Technologies. He is known in the Python community for starting the SymPy project and in the Fortran community for starting LFortran. — John

 

Ondřej Čertík

I finally got to experiment a bit with archiving data on a regular A4 or US Letter page using a regular printer and a phone camera to read it. It’s been bothering me for about 10 years. What is the maximum data size that we can store on the paper and reliably retrieve?

My setup

It seems it is limited by my camera, not my printer. This image stores 30KB of data. I printed it, took a photo with my iPhone, and then decoded and unpacked it using tar and bzip2. It correctly unpacks into 360KB of C++ files, 6305 lines.

I am using the Twibright Optar software.

It uses Golay codes (used by Voyager) that can fix 3 bad bits in each 24-bit code word (which contains 12-bit payload, and 12-bit parity bits). If there are 4 bad bits, the errors are detected, but cannot be corrected. In my experiment, there were 6 codewords which had 3 bad bits, and no codewords with more bad bits, so there was no data loss. It seems most of the bad bits were in the area where my phone cast a shadow on the paper, so possibly retaking the picture in broad daylight might help.

Bad bits

Here are the stats from the optar code:

7305 bits bad from 483840, bit error rate 1.5098%.

49.1855% black dirt, 50.8145% white dirt and 0 (0%) irreparable. Golay stats
===========
0 bad bits      13164
1 bad bit       6693
2 bad bits      297
3 bad bits      6
4 bad bits      0
total codewords 20160

The original setup can store 200KB of data. I tried it, it seems to print fine. But it’s really tiny, and my iPhone camera can’t read it well enough, so nothing is recovered. So I used larger pixels. The 30KB is what I was able to store and retrieve and from the stats above, it seems this is the limit.

Other possibilities

Competing products, such as this one only store 3-4KB/page, so my experiment above is 10x that.

One idea is to use a different error correction scheme. The Golay above only uses 50% to store data. If we could use 75% to store data, that gives us 50% more capacity.

Another idea to improve is to use colors, with 8 colors we get 3x larger capacity, with 16 colors we get 4x more. That would give us around 100KB/page with colors. Can we do better?

Comparison with other storage techniques

I think the closest to compare is floppy disks. The 5¼-inch floppy disk that I used as a kid could store 180 KB single side, 360KB both sides. The 3½-inch floppy disk could store 720 KB single side or 1.44MB both sides. We can also print on both sides of the paper, so let’s just compare single side for both:

  • Optar original (requires a good scanner): 200 KB
  • My experiment above (iPhone): 30 KB
  • Estimate with 8 colors: 120 KB
  • 5¼-inch floppy disk: 180 KB
  • 3½-inch floppy disk: 720 KB

It seems that the 5¼-inch floppy disk is a good target.

Applications

One application is to store this at the end of a book, so that you don’t need to distribute CDs or floppy disks (as used to be the habit), but you just put 10-20 pages to the appendix, this should be possible to decode even 100 years from now quite reliably.

Another application is just archiving any projects that I care about. I still have some floppy disks in my parents’ attic, and I am quite sure they are completely unreadable by now. While I also have some printouts on paper from the same era 30 years ago, and they are perfectly readable.

I think the requirement to use a regular iPhone is good (mine is a few years old, perhaps the newest one can do better?). If we allow scanners, then of course we can do better, but not many people have a high quality scanner at home, and there is no limit to it: GitHub’s Arctic Vault uses microfilm and high quality scanner. I want something that can be put into a book, or article on paper, something that anyone can decode with any phone, and that doesn’t require any special treatment to store.

What’s the Best Code Editor?

Emacs, vi, TextEdit, nano, Sublime, Notepad, Wordpad, Visual Studio, Eclipse, etc., etc.—everyone’s got a favorite.

I used Visual Studio previously and liked the integrated debugger. Recently I started using VS again and found the code editing windows rather cluttered. Thankfully you can tone this down, if you can locate the right options.

Eclipse for Java has instantaneous checking for syntax errors. I have mixed feelings on this. Perhaps you could type a little more code before getting a glaring error message?

Concerning IDEs (integrated development environments) like this—I’ve met people who think that a full GUI-based IDE is the only way to go. Maybe so. However , there’s another view.

You’d think if anyone would know how to write code quickly, accurately and effectively, it would be world-class competitive programmers. They’re the best, right?

One of the very top people is Gennady Korotkevich. He’s won many international competitions.

What does he use? Far Manager, a text-based user interface tool with a mere two panels and command prompt. It’s based on 1980s pre-GUI file manager methodologies that were implemented under DOS.

It reminds me of a conversation I had with our admin when I was in grad school. I asked, “Why do you use vi instead of MS Word for editing documents?” Answer: “I like vi because it’s faster—your fingers never need to leave the keyboard.”

Admittedly, not all developer workflows would necessarily find this approach optimal. But still it makes you think. Sometimes the conventional answer is not the best one.

Do you have a favorite code editor? Please let us know in the comments.

Natural one-liners

I learned to use Unix in college—this was before Linux—but it felt a little mysterious. Clearly it was developed by really smart people, but what were the problems that motivated their design choices?

Some of these are widely repeated. For example, commands have terse names because you may have to transmit commands over a glacial 300 baud network connection.

OK, but why are there so many tools for munging text files, for example? That’s great if your job requires munging text files, but what about everything else. What I didn’t realize at the time was that nearly everything involves munging text files, or can be turned into a problem involving munging text files.

Working with data at the command line

There’s an old joke that Unix is user friendly, it’s just picky about who its friends are. I’d rephrase to say Unix makes more sense when you’re doing the kind of work the Unix developers were doing.

I was writing programs when I learned Unix, so some things about Unix made sense at the time. But I didn’t see the motivation for many of the standard command line tools until I started analyzing datasets years later. I thought awk was cool—it was the first scripting language I encountered—but it wasn’t until years later that I realized awk is essentially a command line spreadsheet. It was written for manipulating tabular data stored in text files.

Mythological scripts

Unix one-liners are impressive, but they can seem like a rabbit out of a hat. How would anyone think to do that?

When you develop your own one liners, one piece at a time, they seem much more natural. You get a feel for how the impressive one-liners you see on display were developed incrementally. They almost certainly did not pop into the world fully formed like Athena springing from the head of Zeus.

Example: Voter registration data

Here’s an example. I was looking at Washington state voter registration data. There’s a file 20240201_VRDB_Extract.txt. What’s in there?

The first line of a data file often contains column headers. Looking at just the first few lines of a file is a perennial task, so there’s a tool for that: head. By default it shows the first 10 lines of a file. We just want to see the first line, and there’s an option for that: -n 1.

> head -n 1 20240201_VRDB_Extract.txt

StateVoterID|FName|MName|LName|NameSuffix|Birthyear|Gender|RegStNum|RegStFrac|RegStName|RegStType|RegUnitType|RegStPreDirection|RegStPostDirection|RegStUnitNum|RegCity|RegState|RegZipCode|CountyCode|PrecinctCode|PrecinctPart|LegislativeDistrict|CongressionalDistrict|Mail1|Mail2|Mail3|MailCity|MailZip|MailState|MailCountry|Registrationdate|LastVoted|StatusCode

Inserting line breaks

OK, those look like column headers, but they’re hard to read. It would be nice if we could replace all the pipe characters used as field separators with line breaks. There’s a command for that too. The sed tool let’s you, among other things, replace one string with another. The tiny sed program

s/|/\n/g

does just what we want. It may look cryptic, but it’s very straight forward. The “s” stands for substitute. The program s/foo/bar/ substitutes the first instance of foo with bar. If you want to replace all instances, you tack on a “g” on the end for “global.”

Eliminating temporary files

We could save our list of column headings to a file, and then run sed on the output, but that creates an unnecessary temporary file. If you do this very much, you get a lot of temporary files cluttering your working area, say with names like temp1 and temp2. Then after a while you start to forget what you named each intermediary file.

It would be nice if you could connect your processing steps together without having to create intermediary files. And that’s just what pipes do. So instead of saving our list of column headers to a file, we pipe it through to sed.

> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g'

FName
MName
LName
NameSuffix
Birthyear
...

Scrolling and searching

This is much better. But it produces more output than you may be able to see in your terminal. You could see the list, one terminal window at a time, by piping the output to less.

> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g' | less

This file only has 33 columns, but it’s not uncommon for a data file to have hundreds of columns. Suppose there were more columns than you wanted to scan through, and you wanted to know whether one of the columns contained a zip code. You could do that by piping the output through grep to look for “zip.”

> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g' | grep -i zip

There are no column headings containing “zip”, but there are a couple containing “Zip.” Adding -i (for case insensitive) finds the zip code columns.

RegZipCode
MailZip

Our modest little one-liner now has three segments separated by pipes. It might look impressive to someone new to working this way, but it’s really just stringing common commands together in a common way.

A famous one-liner

When you see a more complicated one-liner like

tr -cs A-Za-z '
' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

you can imagine how it grew incrementally. Incidentally, the one-liner above is somewhat famous. You can find the story behind it here.

Related posts

Avoiding Multiprocessing Errors in Bash Shell

 

Suppose you have two Linux processes trying to modify a file at the same time and you don’t want them stepping on each other’s work and making a mess.  A common solution is to use a “lock” mechanism (a.k.a. “mutex”). One process “locks the lock” and by this action has sole ownership of a resource in order to make updates, until it unlocks the lock to allow other processes access.

Writing a custom lock in Linux bash shell is tricky. Here’s an example that DOESN’T work right:

#/bin/bash
let is_locked=1 # helper variable to denote locked state
mylockvariable=$(cat mylockfile 2>/dev/null)  # read the lock
while [ "$mylockvariable" != $is_locked ]  # loop until unlocked
do
    sleep 5 # wait 5 seconds to try again 
    mylockvariable=$(cat mylockfile 2>/dev/null)  # read again
done
echo $is_locked > mylockfile  # lock the lock
# >>> do critical work safely here <<<
# >>> ERROR: NOT SAFE <<<
rm mylockfile  # unlock the lock

Here the lock value is stored in a shared resource, the file “mylockfile”. If the file exists and contains the character “1”, the lock is considered locked; otherwise, it is considered unlocked.  The code will loop until the lock is unlocked, then acquire the lock, do the required single-process work, and then release the lock.

However, this code can fail without warning: suppose two processes A and B execute this code concurrently. Initially the lock is in an unlocked state. Process A reads the lockfile. Then suppose immediately after this, Process A is temporarily interrupted, perhaps to give CPU cycles to run Process B. Then, suppose Process B begins, reads the lock, locks the lock and starts doing its critical work. Suppose now Process B is put into wait state and Process A is restarted. Process A, since it previously read the lockfile, wrongly believes the lock is unlocked, thus proceeds to also lock the lock and do the critical work—resulting in a mess.

This is an example of a classic race condition, in which the order of execution of threads or processes can affect the final outcome of execution.

A solution to this conundrum is found in the excellent book, Unix Power Tools [1,2]. This is a hefty tome but very accessibly written, for some people well worth a read-through to pick up a slew of time-saving tips.

The problem with the example code is the need to both read and set the lock in a single, indivisible (atomic) operation. Here’s a trick to do it:

#/bin/bash
until (umask 222; echo > mylockfile) 2>/dev/null  # check and lock
do  # keep trying if failed
    sleep 5 # wait 5 seconds to try again 
done
# >>> do critical work safely here <<<
rm -f mylockfile  # unlock the lock

Here, the existence of the lockfile itself is the indicator that the lock is set. Setting the umask makes this file creation fail if the file already exists, triggering the loop to activate to keep trying. This works because the existence of a file can either be true or false and nothing else; the existence of a file is guaranteed atomicity by the OS and the filesystem. Thus, assuming the system is working correctly, this code is guaranteed to produce the desired behavior.

Race conditions can be a nuisance to find since their occurrence is nondeterministic and can be rare but devastating. Writing correct code for multiple threads of execution can be confusing to those who haven’t done it before. But with experience it becomes easier to reason about correctness and spot such errors.

References:

[1] Peek, Jerry D., Shelley Powers, Tim O’Reilly and Mike Loukides. “Unix Power Tools, Third Edition.” (2002), https://learning.oreilly.com/library/view/unix-power-tools/0596003307/

[2] https://learning.oreilly.com/library/view/unix-power-tools/0596003307/ch36.html#:-:text=Shell%20Lockfile

Do comments in a LaTeX file change the output?

When you add a comment to a LaTeX file, it makes no visible change to the output. The comment is ignored as far as the appearance of the file. But is that comment somehow included in the file anyway?

If you compile a LaTeX file to PDF, then edit it by throwing in a comment, and compile again, your two files will differ. As I wrote about earlier, the time that a file is created is embedded in a PDF. That time stamp is also included in two or three hashes, so the files will differ by more than just the bits in the time stamp.

But even if you compile two files at the same time (within the resolution of the time stamp, which is one second), the PDF files will still differ. Apparently some kind of hash of the source file is included in the PDF.

So suppose you have two files. The content of foo.tex is

    \documentclass{article}
    \begin{document}
    Hello world.
    \end{document}

and the content of bar.tex is

    \documentclass{article}
    \begin{document}
    Hello world. % comment
    \end{document}

then the output of running pdflatex on both files will look the same.

Suppose you compile the files at the same time so that the time stamps are the same.

    pdflatex foo.tex && pdflatex bar.tex

It’s possible that the two time stamps could be different, one file compiling a little before the tick of a new second and one compiling a little after. But if your computer is fast enough and you don’t get unlucky, the time stamps will be the same.

Then you can compare hex dumps of the two PDF files with

    diff  <(xxd foo.pdf) <(xxd bar.pdf)

This produces the following

    < ...  ./ID [<F12AF1442
    < ...  E03CC6B3AB64A5D9
    < ... 8DEE2FE> <F12AF1
    < ...  442E03CC6B3AB64A
    < ... 5D98DEE2FE>]./Le
    --
    > ...  ./ID [<4FAA0E9F1
    > ...  CC6EFCC5068F481E
    > ...  0419AD6> <4FAA0E
    > ...  9F1CC6EFCC5068F4
    > ...  81E0419AD6>]./Le

You can’t recover the comment from the binary dump, but you can tell that the files differ.

I don’t know what hash is being used. My first guess was MD5, but that’s not it. It’s a 128-bit hash, so that rules out newer hashes like SHA256. I tried searching for it but didn’t find anything. If you know what hash pdflatex uses, please let me know.

LaTeX will also let you add text at the end of the file, after the \end{document} command. This also will change the hash code but will not change the appearance of the output.

Related posts

If you save a file as PDF twice, you get two different files

If you save a file as a PDF twice, you won’t get exactly the same file both times. To illustrate this, I created an LibreOffice document containing “Hello world.” and saved it twice, first as humpty.pdf then as dumpty.pdf. Then I compared the two files.

    % diff humpty.pdf dumpty.pdf
    Binary files humpty.pdf and dumpty.pdf differ

That’s curious. Both files are the same size, 7260 bytes, but something is different inside.

Next I dumped both files to hexadecimal and compared the output.

.   % diff  <(xxd humpty.pdf) <(xxd dumpty.pdf)

This produced two ranges of differences. Here's the first:

    < 00001a60: ... 3232 ...  064322-06'00')>>
    ---
    > 00001a60: ... 3339 ...  064339-06'00')>>

The files differ in two consecutive bytes. The ASCII representation at the end of the lines shows what these bytes mean. Apparently these two bytes are part of a time stamp. The first was produced at 6:43:22 this morning and the second was produced 17 seconds later at 6:43:39.

There's another block of differences further down the file. I'll leave out the hex representation of the bytes to save space and just include the positions and the ASCII representation.

    < 00001bc0: ...  13 0 R./ID [ <CB
    < 00001bd0: ...  4185E1FB366E0C64
    < 00001be0: ...  D65ADF317ACB6A>.
    < 00001bf0: ...  <CB4185E1FB366E0
    < 00001c00: ...  C64D65ADF317ACB6
    < 00001c10: ...  A> ]./DocChecksu
    < 00001c20: ...  m /59EF0E5B9A2CC
    < 00001c30: ...  4AEC9FD90E7BBE23
    < 00001c40: ...  0CC.>>.startxref
    ---
    > 00001bc0: ...  13 0 R./ID [ <7D
    > 00001bd0: ...  1441609E44A5446A
    > 00001be0: ...  8A0F9A4E96FF49>.
    > 00001bf0: ...  <7D1441609E44A54
    > 00001c00: ...  46A8A0F9A4E96FF4
    > 00001c10: ...  9> ]./DocChecksu
    > 00001c20: ...  m /A7A3CD305537E
    > 00001c30: ...  B3DC35BA5EB4678F
    > 00001c40: ...  EDA.>>.startxref

The text DocChecksum jumps out. This looks like a 32-bit check sum. If I had to guess, I'd say it's probably CRC-32. And apparently there's some sort of 32-bit hash before the checksum: CB4 ...B6A in humpty.pdf and 7D1...F49 in dumpty.pdf. This must be some sort of hash. The hash is repeated twice in each file. Maybe this is some sort of versioning information, and the hash is repeated because the initial and final versions of the file are the same.

The fact that the files were saved 17 seconds apart changed two bytes in the timestamps. But changing these two bytes caused the two 32-byte hash codes to change.

Related posts

Is Low Precision Arithmetic Safe?

The popularity of low precision arithmetic for computing has exploded since the 2017 release of the Nvidia Volta GPU. The half precision tensor cores of Volta offered a massive 16X performance gain over double precision for key operations. The “race to the bottom” for lower precision computations continues: some have even solved significant problems using 1-bit precision arithmetic hardware ([1], [2]). And hardware performance is getting even better: the Nvidia H100 tensor core-enabled FP16 is a full 58X faster than standard FP64, and 1-bit precision is yet another 16X faster than this, for total speedup of over 900X for algorithms that can use it [3].

This eye-popping speedup certainly draws attention. However, in scientific computing, low precision arithmetic has typically been seen as unsafe for modeling and simulation codes. Indeed, lower precision can sometimes be used to advantage [4], commonly in a “mixed precision” setting in which only parts of the calculation are done in low precision. However, in general anything less than double precision is considered inadequate to model complex physical phenomena with fidelity (see, e.g., [5]).

In response, developers have created tools to measure the safety of reduced precision arithmetic in application codes [6]. Some tools can even identify which variables or arrays can be safely demoted to lower precision without loss of accuracy in the final result. However, use of these tools in a blind fashion, not backed by some kind of reasoning process, can be hazardous.

An example will illustrate this. The conjugate gradient method for linear system solving and optimization [7] and the closely related Lanczos method for eigenvalue problem solving [8] showed great promise following their invention in the early 1950s. However, they were considered unsafe due to catastrophic roundoff errors under floating point arithmetic—even more pronounced as floating point precision is reduced. Nonetheless, Chris Paige showed in his pioneering work in the 1970s [9] that the roundoff error, though substantial, did not preclude the usefulness of the methods when properly used. The conjugate gradient method has gone on to become a mainstay in scientific computing.

Notice that no tool could possibly arrive at this finding, without a careful mathematical analysis of the methods. A tool would detect inaccuracy in the calculation but could not certify that these errors could cause no harm to the final result.

Some might propose instead a purely data-driven approach: just try low precision on some test cases, if it works then use low precision in production. This approach is fraught with peril, however: the test cases may not capture all situations that could be encountered in production.

For example, one might test an aerodynamics code only on smooth flow regimes, but production runs may encounter complex flows with steep gradients—that low precision arithmetic cannot correctly model. Academic papers that test low precision methods and tools must rigorously evaluate in challenging real-world scenarios like this.

Sadly, computational science teams frequently don’t have the time to evaluate their codes for potential use of lower precision arithmetic. Tools could certainly help. Also, libraries that encapsulate mixed precision methods can provide benefits to many users. A great success story here is mixed precision dense linear solvers, founded on the solid theoretical work of Nick Highnam and colleagues [10], which has found its way into libraries such as [11].

So the final answer is, “it depends.” Each new case must be looked at carefully, and a determination made based on some combination of analysis and testing.

References

[1] Zhang, Y., Garg, A., Cao, Y., Lew, Ł., Ghorbani, B., Zhang, Z. and Firat, O., 2023. Binarized Neural Machine Translation. arXiv preprint arXiv:2302.04907, https://openreview.net/forum?id=XAyPlfmWpu

[2] Lagergren, J., Cashman, M., Vergara, V.G.M., Eller, P.R., Gazolla, J.G.F.M., Chhetri, H.B., Streich, J., Climer, S., Thornton, P., Joubert, W. and Jacobson, D., 2023. Climatic clustering and longitudinal analysis with impacts on food, bioenergy, and pandemics. Phytobiomes Journal, 7(1), pp.65-77, https://apsjournals.apsnet.org/doi/10.1094/PBIOMES-02-22-0007-R.

[3] “NVIDIA H100 Tensor Core GPU Datasheet,” https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet.

[4] G. Alvarez et al., “New algorithm to enable 400+ TFlop/s sustained performance in simulations of disorder effects in high-Tc superconductors,” SC ’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, Austin, TX, USA, 2008, pp. 1-10, doi: 10.1109/SC.2008.5218119.

[5] Spafford, K., Meredith, J., Vetter, J., Chen, J., Grout, R., Sankaran, R. (2010). Accelerating S3D: A GPGPU Case Study. In: Lin, HX., et al. Euro-Par 2009 – Parallel Processing Workshops. Euro-Par 2009. Lecture Notes in Computer Science, vol 6043. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14122-5_16.

[6] “Mixed precision analysis tools,” https://scholar.google.com/scholar?q=mixed+precision+analysis+tools

[7] Hestenes, M.R. and Stiefel, E., 1952. Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards49(6), pp.409-436, https://nvlpubs.nist.gov/nistpubs/jres/049/jresv49n6p409_a1b.pdf.

[8] Cornelius Lanczos, An Iteration Method for the Solution of the Eigenvalue Problem of Linear Differential and Integral Operators, Journal of Research of the National Bureau of Standards Vol. 45, No. 4, October 1950, https://nvlpubs.nist.gov/nistpubs/jres/045/4/V45.N04.A01.pdf.

[9] Paige, Christopher C.. “The computation of eigenvalues and eigenvectors of very large sparse matrices.” (1971), https://www.cs.mcgill.ca/~chris/pubClassic/PaigeThesis.pdf.

[10] Higham, N.J., Pranesh, S. and Zounon, M., 2019. Squeezing a matrix into half precision, with an application to solving linear systems. SIAM Journal on Scientific Computing41(4), pp.A2536-A2551, https://epubs.siam.org/doi/abs/10.1137/18M1229511.

[11] Lu, Hao; Matheson, Michael; Wang, Feiyi; Joubert, Wayne; Ellis, Austin; Oles, Vladyslav. “OpenMxP-Opensource Mixed Precision Computing,” https://www.osti.gov/biblio/1961398.

Regex to match SWIFT-BIC codes

A SWIFT-BIC number identifies a bank, not a particular bank account. The BIC part stands for Bank Identifier Code.

I had to look up the structure of SWIFT-BIC codes recently, and here it is:

  • Four letters to identify the bank
  • Two letters to identify the country
  • Two letters or digits to identify the location
  • Optionally, three letters or digits to identify a branch

Further details are given in the ISO 9362 standard.

We can use this as an example to illustrate several regular expression features, and how regular expressions are used in practice.

Regular expressions

If your regular expression flavor supports listing a number of repetitions in braces, you could write the above format as

    [A-Z]{6}[A-Z0-9]{2,5}

This would work, for example, with egrep but not with grep. YMMV.

That’s concise, but a little too permissive. It allows anywhere from 2 to 5 alphanumeric characters on the end. But the standard says 2 or 5 alphanumeric characters after the country code, not between 2 and 5. For example, 3 characters after the country code would no be valid. So we could reduce our false positive rate a little by changing the regex to

    [A-Z]{6}[A-Z0-9]{2}([A-Z0-9]{3})?$

Without the dollar sign on the end, ABCDEF12X would still match because the part of the regex up to the optional ([A-Z0-9]{3})? at the end would match at the beginning of the string. The dollar sign marks the end of the string, so it says the code has to end either after 8 or 11 characters and stop.

If your regex flavor does not support counts in braces, you could spell everything out:

    [A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z0-9][A-Z0-9]([A-Z0-9]{3})?$

Convenience versus accuracy

If you want to match only valid SWIFT-BIC codes, you can get perfect accuracy by checking against an exhaustive list of SWIFT-BIC codes. You could even write a regular expression that matches codes on this list and only codes on the list, but what would the point be? Regular expressions usually tradeoff convenience for accuracy.

I don’t have a list of all valid SWIFT-BIC codes. If I did, it might be out of date by the time I download it. But if I’m trying to pull bank codes out of a text file, the regex

    [A-Z]{6}[A-Z0-9]{2}([A-Z0-9]{3})?$

is likely to do a pretty good job. Regular expressions are usually used in a context where there’s some tolerance for error. Maybe you use a regular expression to do a first pass, then weed out the mismatches with a manual review.

Capturing parts

Maybe you want to do more than just find SWIFT codes. Maybe you want to look at their pieces.

For example, the fifth and sixth characters of a SWIFT code are the ISO 3166 two-letter abbreviation for the country the bank is in. (With one exception: XR represents Kosovo, which does not have an ISO 3166 code.)

You could replace

    [A-Z]{6}

at the front of the regular expression with

    [A-Z]{4}([A-Z]{2})

which will not change which strings match, but it will store the fifth and sixth characters as the first captured group. How you access captured group varies between various regular expression implementations.

Legibility

The first proposed regular expression

    [A-Z]{6}[A-Z0-9]{2,5}

is easy to read, at least in my opinion. It has grown over the course of this post to

    [A-Z]{4}([A-Z]{2})[A-Z0-9]{2}([A-Z0-9]{3})?$

which is not as easy to read. This is typical: you start with a quick-and-dirty regular expression, the refine it until it meets your needs. Regular expressions tend to get uglier as they become more precise.

There are ways to make regular expressions more readable by using something like the /x modifier in Perl, which lets you insert white space and comments inside a regular expression.

That’s nice, but it’s also a little odd. If you’re going to use a complicated regular expression in production code, then you should format it nicely and add comments. But then you have to ask why you’re using a complicated regular expression in production code. I’m not saying this is never appropriate, but it’s not the most common use case.

I could imagine using a simple regular expression when you want quick and dirty, and using an exhaustive list of SWIFT codes in production. A complex, well-commented regular expression seems to fall into a sort of no man’s land in between.