Regex to match SWIFT-BIC codes

A SWIFT-BIC number identifies a bank, not a particular bank account. The BIC part stands for Bank Identifier Code.

I had to look up the structure of SWIFT-BIC codes recently, and here it is:

  • Four letters to identify the bank
  • Two letters to identify the country
  • Two letters or digits to identify the location
  • Optionally, three letters or digits to identify a branch

Further details are given in the ISO 9362 standard.

We can use this as an example to illustrate several regular expression features, and how regular expressions are used in practice.

Regular expressions

If your regular expression flavor supports listing a number of repetitions in braces, you could write the above format as

    [A-Z]{6}[A-Z0-9]{2,5}

This would work, for example, with egrep but not with grep. YMMV.

That’s concise, but a little too permissive. It allows anywhere from 2 to 5 alphanumeric characters on the end. But the standard says 2 or 5 alphanumeric characters after the country code, not between 2 and 5. For example, 3 characters after the country code would no be valid. So we could reduce our false positive rate a little by changing the regex to

    [A-Z]{6}[A-Z0-9]{2}([A-Z0-9]{3})?$

Without the dollar sign on the end, ABCDEF12X would still match because the part of the regex up to the optional ([A-Z0-9]{3})? at the end would match at the beginning of the string. The dollar sign marks the end of the string, so it says the code has to end either after 8 or 11 characters and stop.

If your regex flavor does not support counts in braces, you could spell everything out:

    [A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z0-9][A-Z0-9]([A-Z0-9]{3})?$

Convenience versus accuracy

If you want to match only valid SWIFT-BIC codes, you can get perfect accuracy by checking against an exhaustive list of SWIFT-BIC codes. You could even write a regular expression that matches codes on this list and only codes on the list, but what would the point be? Regular expressions usually tradeoff convenience for accuracy.

I don’t have a list of all valid SWIFT-BIC codes. If I did, it might be out of date by the time I download it. But if I’m trying to pull bank codes out of a text file, the regex

    [A-Z]{6}[A-Z0-9]{2}([A-Z0-9]{3})?$

is likely to do a pretty good job. Regular expressions are usually used in a context where there’s some tolerance for error. Maybe you use a regular expression to do a first pass, then weed out the mismatches with a manual review.

Capturing parts

Maybe you want to do more than just find SWIFT codes. Maybe you want to look at their pieces.

For example, the fifth and sixth characters of a SWIFT code are the ISO 3166 two-letter abbreviation for the country the bank is in. (With one exception: XR represents Kosovo, which does not have an ISO 3166 code.)

You could replace

    [A-Z]{6}

at the front of the regular expression with

    [A-Z]{4}([A-Z]{2})

which will not change which strings match, but it will store the fifth and sixth characters as the first captured group. How you access captured group varies between various regular expression implementations.

Legibility

The first proposed regular expression

    [A-Z]{6}[A-Z0-9]{2,5}

is easy to read, at least in my opinion. It has grown over the course of this post to

    [A-Z]{4}([A-Z]{2})[A-Z0-9]{2}([A-Z0-9]{3})?$

which is not as easy to read. This is typical: you start with a quick-and-dirty regular expression, the refine it until it meets your needs. Regular expressions tend to get uglier as they become more precise.

There are ways to make regular expressions more readable by using something like the /x modifier in Perl, which lets you insert white space and comments inside a regular expression.

That’s nice, but it’s also a little odd. If you’re going to use a complicated regular expression in production code, then you should format it nicely and add comments. But then you have to ask why you’re using a complicated regular expression in production code. I’m not saying this is never appropriate, but it’s not the most common use case.

I could imagine using a simple regular expression when you want quick and dirty, and using an exhaustive list of SWIFT codes in production. A complex, well-commented regular expression seems to fall into a sort of no man’s land in between.

New Ways To Make Code Run Faster

Racecar

The news from Meta last week is a vivid reminder of the importance of making code run faster and more power-efficiently. Meta intends to purchase 350,000 Nvidia H100 GPUs this year [1]. Assuming 350W TDP [2] and $0.1621 per kW-h [3] average US energy cost, one expects a figure of $174 million per year in electricity expenses just to power the GPUs (note this is only a rough estimate of the actual). For this and many other datacenter-scale and real-time critical applications, every bit of increased performance can be impactful.

Many approaches can be taken to making code run faster. The excellent book Hacker’s Delight contains many tricks for speeding up low-level code kernels [4]. Also, superoptimization techniques find the very fastest performing implementations of short, loop-free code kernels by exhaustive search [5].

Since exhaustive search scales exponentially with code size, it’s no surprise that other methods have been tried. Recent work with reinforcement learning reduces the number of scalar multiples needed for matrix products, discovering new Strassen-like algorithms [6]. Other work focuses more on hardware design; for example, PrefixRL optimizes chip design for targeted problems using reinforcement learning [7].

Finding the absolute fastest code for a given task is in general NP-complete [8]. This problem is out of reach for tractable solution by such methods. However, phenomenal progress in SAT/SMT solver efficiency for NP-complete problems has been made over the last 20 years (someone has even quipped, “NP is the new P” [9]). And indeed, these methods have been applied to superoptimization problems [10, 11]. Perhaps methodologies based on efficient SAT and SMT solvers will afford further opportunities for advancement.

The need for speed and power efficiency has become so acute that radically different non-von Neumann processors are being built [12], in some cases orders of magnitude more power efficient but restrictive in the problems they can solve. In the coming years we can expect to see a huge amount activity and developments on this important problem.

[1] “Meta will have 350,000 of Nvidia’s fastest AI GPUs by end of year, buying AMD’s MI300, too,” Tom’s Hardware, Jan. 19, 2024, https://www.tomshardware.com/tech-industry/meta-will-have-350000-of-nvidias-fastest-ai-gpus-by-end-of-year-buying-amds-mi300-too

[2] “NVIDIA H100 Tensor Core GPU Datasheet,” https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet

[3] “Electricity Rates for Every State in The U.S.,” https://www.energybot.com/electricity-rates/

[4] Henry Warren, “Hacker’s Delight,” 2nd Edition, 2012, https://web.archive.org/web/20190915025154/http://www.hackersdelight.org/

[5] “Superoptimization”, https://en.wikipedia.org/wiki/Superoptimization

[6] Fawzi, A., Balog, M., Huang, A. et al. “Discovering faster matrix multiplication algorithms with reinforcement learning,” Nature 610, 47–53 (2022). https://doi.org/10.1038/s41586-022-05172-4

[7] Rajarshi Roy et al., “PrefixRL: Optimization of Parallel Prefix Circuits using Deep Reinforcement Learning,” https://arxiv.org/abs/2205.07000

[8] Hennessy, John L., and Thomas Gross. “Postpass code optimization of pipeline constraints.” ACM Transactions on Programming Languages and Systems (TOPLAS) 5, no. 3 (1983): 422-448.

[9] Arie Gurfinkel, “Algorithms for SAT,” https://ece.uwaterloo.ca/~agurfink/ece750t29/assets/pdf/02_SAT.pdf

[10] Abhinav Jangda and Greta Yorsh. 2017. “Unbounded superoptimization,” in Proceedings of the 2017 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward! 2017). Association for Computing Machinery, New York, NY, USA, 78–88. https://doi.org/10.1145/3133850.3133856

[11] Phitchaya Mangpo Phothilimthana, Aditya Thakur, Rastislav Bodik, and Dinakar Dhurjati. 2016. “GreenThumb: superoptimizer construction framework,” in Proceedings of the 25th International Conference on Compiler Construction (CC 2016). Association for Computing Machinery, New York, NY, USA, 261–262. https://doi.org/10.1145/2892208.2892233

[12] “OpenAI Agreed to Buy $51 Million of AI Chips From a Startup Backed by CEO Sam Altman,” Wired, Dec. 3, 2023, https://www.wired.com/story/openai-buy-ai-chips-startup-sam-altman/

 

Brute force cryptanalysis

A naive view of simple substitution ciphers is that they are secure because there are 26! ways to permute the English alphabet, and so an attacker would have to try 26! ≈ 4 × 1026 permutations. However, such brute force is not required. In practice, simple substitution ciphers are breakable by hand in a few minutes, and you can find software that automates the process.

However, for modern encryption, apparently brute force is required. If you encrypt a message using AES with a 128-bit key, for example, you can’t do much better than try 2128 keys. You might be able to do a little better, but as far as is openly known, you can’t do orders of magnitude better.

Even for obsolete encryption methods such as DES it still takes a lot more effort to break encryption than to apply encryption. The basic problem with DES is that it used 56-bit keys, and trying 256 keys is feasible [1]. You won’t be able to do it on your laptop, but it can be done using many processors in parallel [2]. Still, you’d need more than a passing curiosity about a DES encrypted message before you’d go to the time and expense of breaking it.

If breaking a simple substitution cipher really did require brute force, it would offer 88-bit security. That is, 26! roughly equals 288. So any cipher offering b-bit security for b > 88 is more secure in practice than breaking simple substitution ciphers would be in naive theory. This would include AES, as well as many of its competitors that weren’t chosen for the standard, such as Twofish.

For all the block ciphers mentioned here, the number of bits of security they offer is equal to the size of the key in bits. This isn’t always the case. For example, the security level of an RSA key is much less than the size of the key, and the relation between key size and security level is nonlinear.

A 1024-bit RSA modulus is believed to offer on the order of 87 bits security, which incidentally is comparable to 26! as mentioned above. NIST FIPS 184-5 recommends 2048 bits as the minimum RSA modulus size. This gives about 117 bits of security.

The security of RSA depends on the difficulty of factoring the product of large primes [3], and so you can compute the security level of a key based on the efficiency of the best known factoring algorithm, which is currently the General Number Field Sieve. More on this here.

Related posts

[1] There are ways to do better than brute force against DES, if you have an enormous number of messages all encrypted with the same key.

[2] In 1998, the EFF built a machine called Deep Crack with 1,856 custom processors that could crack DES encoded messages in nine days, four and a half days on average.

[3] Nobody has proved that breaking RSA requires factoring. There is a variation on RSA that is provably as hard as factoring but as far as I know it has never been widely used.

Base 64 encoding remainder problem

I’ve mentioned base 64 encoding a few times here, but I’ve left out a detail. This post fills in that detail.

Base 64 encoding comes up in multiple contexts in which you want to represent binary data in text form. I’ve mentioned base 64 encoding in the context of Gnu ASCII armor. A more common application is MIME (Multipurpose Internet Mail Extensions) encoding.

Base 64 encoding uses 64 characters (A…Z, a…z, 0…9, +, and /) to represent six bits at a time.

In the previous post I showed how ASCII armor encoded a 91,272 byte JPEG image into a text file and how it could convert the text back into binary. The number of bytes in the file a multiple of 3, which you could quickly confirm by casting out nines.

If the number of bytes in a file is not a multiple of three, the number of bits is not a multiple of six, and so we have to do something with the remainder.

For an example, let’s start with a file containing the bits

000000 000001 000010 000011 000100 000101 000110 000111

If we run gpg --enarmor on this file, we get

ABCDEFGH
=/u99

and some extra text for human consumption. The base 64 encoding is ABCDEFGH, the equal sign is a separator, and /u99 is a checksum.

If we delete the last 8 bits from our file and run ASCII armor again, we get

ABCDEFE=
=/IPL

The second equal sign separates the base 64 encoding from the checksum, but the first equal sign is something new.

If we chop eight more bits off the end of the file we get

ABCDEA==
=izh9

What’s going on here? In each case, the first 30 bits are being encoded as ABCDE. The remaining bits are

000101 000110 000111

When we cut off the last 8 bits we were left with

000101 0001

The bits 00101 are encoded as F, and the last four bits are padded to 00100 and encoded as E. The trailing equal sign is a signal that two bits were added as padding.

When we cut off 8 more bits we were left with

00

which was padded to 000000 and encoded as A. Then two equal signs were added to signal that four bits were added as padding.

So the rule is add two or four 0s at the end to make the number of bits a multiple of six. Then add an equal sign for each pair of bits added.

Since file sizes are multiples of bytes, and a byte is 8 bits, the number of bits in a file is always even. This means the remainder when the number of bits is divided by 6 is 0, 2, or 4. So if we add padding, we only add two or four zero bits and never an odd number of bits.

Binary to text to binary

Gnu Privacy Guard includes a way to encode binary files as plain ASCII text files, and turn these text files back into binary. This is intended as a way to transmit encrypted data, but it can be used to convert any kind of file from binary to text and back to binary.

To illustrate this, I’ll use Albrecht Dürer’s Melencolia I as an example. (More on this image and its mathematical significance here.)

Albrecht Dürer’s engraving Melencolia I

This image is saved in the file Melancholia.jpg.

Binary to text

If we run

   gpg --enarmor Melencolia.jpg

at a command line, it produces a file Melancholia.jpg.asc, the asc suffix indicating an ASCII file.

We can look inside this file if we’d like.

-----BEGIN PGP ARMORED FILE-----
Comment: Use "gpg --dearmor" for unpacking

/9j/4AAQSkZJRgABAQEBLAEsAAD/4QBoRXhpZgAATU0AKgAAAAgABAEaAAUAAAAB
AAAAPgEbAAUAAAABAAAARgEoAAMAAAABAAIAAAExAAIAAAARAAAATgAAAAAAAAEs
AAAAAQAAASwAAAABUGFpbnQuTkVUIHYzLjUuOAAA/9sAQwACAQECAQECAgICAgIC
...
wJ/wkzRf8IN4LCCVVwNCtM8sDnPl5zkk565NbcH7Lnw01K0gtrn4feCriB4QfLl0
S2dV+bdgApwMknAwMk+tFFcktjqif//Z
=rpd1
-----END PGP ARMORED FILE-----

The cryptic text /9j/4A…//Z is a base 64 encoded representation of the binary file. Think of the file as one big binary number. Write that number in base 64, i.e. partition the bits into groups of six. Then represent the base 64 “digits” as alphanumeric characters, plus the symbols + and /. More on the details here.

The line =rpd1 is a 24-bit CRC checksum. The equal sign is a separator, and rpd1 is a base 64 encoding the checksum.

The JPG file is 91,272 bytes and the ASCII file is 123,712 bytes. The ASCII file is about 1/3 larger because every six bits in the binary file corresponds to an eight-bit ASCII character. The ASCII file is a little bit more than 1/3 larger because of the human-friendly text above and below the base 64 encoding, the newline characters, and the checksum.

Text to binary

If we run

    gpg --dearmor Melencolia.jpg.asc

at a command line, it produces a file Melancholia.jpg.asc.gpg. This file is bit-for-bit exactly the same as the original file, which we could confirm by running

    diff Melencolia.jpg Melencolia.jpg.asc.gpg

Related posts

Why “a caret, euro, trademark” ’ in a file?

A with caret, euro, trademark

Why might you see ’ in the middle of an otherwise intelligible file? The reason is very similar to the reason you might see �, which I explained in the previous post. You might want to read that post first if you’re not familiar with Unicode and character encodings.

It all has to do with an encoding error, probably. Not necessarily, since, for example, I deliberately put ’ in the opening sentence. But assuming it is an error, it’s likely an encoding error.

But it’s the opposite of the � error. The � occurs when non- UTF-8 text has been declared (or implicitly interpreted as) Unicode. In particular, you can run into this error if text encoded in ISO 8859-1 is interpreted as as UTF-8.

The ’ sequence is usually the opposite: UTF-8 encoded text is being interpreted as Windows-1252 (a.k.a. CP-1252) encoded text. In particular, a single quote (U+2019) encoded in UTF-8 has been interpreted as the Windows-1252 text ’.

Windows-1252 is a superset of IDO 8859-1, the error resulting in � could also be described as a Windows-1252 error. So a � means Windows-1252 text has been interpreted as UTF-8, and ’ means UTF-8 has been interpreted as Windows-1252. In the former case there is an invalid character. In the latter case all the characters are valid, though they’re not the characters you were supposed to see.

You can fix the error by making your content and your encoding match. Or remove the offending character, replacing the single quote with ’.

You can find more details in this Stack Overflow post.

A valid character to represent an invalid character

U+FFFD REPLACEMENT CHARACTER

You may have seen a web page with the symbol � scattered throughout the text, especially in older web pages. What is this symbol and why does it appear unexpected?

The symbol we’re discussing is a bit of a paradox. It’s the (valid) Unicode character to represent an invalid Unicode character. If you just read the first sentence of this post, without inspecting the code point values, you can’t tell whether the symbol appears because I’ve made a mistake, or whether I’ve deliberately included the symbol.

The symbol in question is U+FFFD, named REPLACEMENT CHARACTER, a perfectly valid Unicode character. But unlike this post, you’re most likely to see it when the author did not intend for you to see it. What’s going on?

It all has to do with character encoding.

If all you want to do is represent Roman letters and common punctuation marks, and control characters, ASCII is fine. There are 128 ASCII characters, so they fit into a single 8-bit byte. But as soon as you want to write façade, jalapeño, or Gödel you have a problem. And of course you have a bigger problem if your language doesn’t use the Roman alphabet at all.

ASCII wastes one bit per byte, so naturally people wanted to take advantage of that extra bit to represent additional characters, such as the ç, ñ, and ö above. One popular way of doing this was described in the standard ISO 8859-1.

Of course there are other ways of encoding characters. If your language is Russian or Hebrew or Chinese, you’re no happier with ISO 8859-1 than you are with ASCII.

Enter Unicode. Let’s represent all the word’s alphabets (and ideograms and common symbols and …) in a single system. Great idea, though there are a mind-boggling number of details to work out. Even so, once you’ve assigned a number to every symbol that you care about, there’s still more work to do.

You could represent every character with two bytes. Now you can represent 65,536 characters. That’s too much and too little. If you want to represent text that is essentially made of Roman letters plus occasional exotic characters, using two bytes per letter makes the text 100% larger.

And while 65,536 sounds like a lot of characters, it’s not enough to represent every Chinese character, much less all the characters in other languages. Turns out we need four bytes to do what Unicode was intended to do.

So now we have to deal with encodings. UTF-8 is a brilliant solution to this problem. It can handle all Unicode characters, but if text is just ASCII, it won’t be any longer: ASCII is a valid UTF-8 encoding of the subset of Unicode that belonged to ASCII.

But there were other systems before Unicode, like ISO 8859-1, and if your file is encoded as ISO 8859-1, but a web browser thinks its UTF-8 encoded Unicode, some characters could be invalid. Browsers will use the character � as a replacement for invalid text that it could not otherwise display. That’s probably what’s going on when you see �.

See the article on How UTF-8 works to understand why some characters are not just a different character than intended but illegal UTF-8 byte sequences.

When High Performance Computing Is Not High Performance

Everybody cares about codes running fast on their computers. Hardware improvements over recent decades have made this possible. But how well are we taking advantage of hardware speedups?

Consider these two C++ code examples. Assume here n = 10000000.

void sub(int* a, int* b) {
    for (int i=0; i<n; ++i)
        a[i] = i + 1;
    for (int i=0; i<n; ++i)
        b[i] = a[i];
}
void sub(int* a, int* b) {
    for (int i=0; i<n; ++i) {
        const int j = i + 1;
        a[i] = j;
        b[i] = j;
    }
}

Which runs faster? Both are simple and give identical results (assuming no aliasing). However on modern architectures, depending on the compilation setup, one will generally run significantly faster than the other.

In particular, Snippet 2 would be expected to run faster than Snippet 1. In Snippet 1, elements of the array “a”, which is too large to be cached, must be retrieved from memory after being written, but this is not required for Snippet 2. The trend for over two decades has been for compute speed of newly delivered systems to grow much faster than memory speed, and the disparity is extreme today. The performance of these kernels is bound almost entirely by memory bandwidth speed. Thus Snippet 2, a fused loop version of Snippet 1, improves speed by reducing main memory access.

Libraries like C++ STL are unlikely to help, since this operation is too specialized to expect a library to support it (especially the fused loop version). Also, the compiler cannot safely fuse the loops automatically without specific instructions that the pointers are unaliased, and even then is not guaranteed to do so.

Thankfully, high level computer languages since the 1950s have raised the programming abstraction level for all of us. Naturally, many of us would like to just implement the required business logic in our codes and let the compiler and the hardware do the rest. But sadly, one can’t always just throw the code on a computer and expect it to run fast. Increasingly, as hardware becomes more complex, giving attention to the underlying architecture is critical to getting high performance.

Previous digital signature standard expires next month

The Digital Signature Standard (DSS) FIPS 184-4, first published in 2013, expires a few days from now, on February 3, 2024. It is superseded by NIST FIPS 184-5. This new version was published on February 3, 2023, giving everyone a year to adopt the new standard before it became required.

The differences between the two standards are summarized in Appendix E of the new standard. Here I’ll point out three differences, then expand a little on the third difference.

First of all, the Digital Signature Algorithm (DSA) from earlier versions of FIPS 184 is withdrawn.

Second, RSA keys have gotten longer. The previous minimum key length was 1024 bits. Now it is 2048.

Third, elliptic curve cryptography has matured quite a bit since 2013.

The 2013 version of the standard gave users a lot more freedom in choosing elliptic curves and base points on those curves. Now it appears that much of this freedom hasn’t turned out so well.

The binary curves and Koblitz curves that were approved in 2013 are now deprecated. These are the curves whose names begin with B– or K– as described in this post on elliptic curve naming conventions. But the P– curves P-224, P-256P-384, and P-521 are still recommended.

While the binary and Koblitz curves have been removed, Edwards curves were added. Specifically Ed25519 and Ed448 have been added for the Edwards Digital Signature algorithm (EdDSA).

Related posts

Computing square root floor

Given an integer n, suppose you want to compute ⌊√n⌋, the greatest integer less than or equal to the square root of n. One reason you may want to compute ⌊√n⌋ is to know when you can stop trial division when factoring n. Similarly, ⌊√n⌋ gives you a starting point for Fermat’s factorization algorithm.

With floating point arithmetic

The obvious approach to computing ⌊√n⌋ would be to compute the square root of n as a real number and then round the result down.

Suppose we go down that route. How would we compute √n as real number? The usual way is to use Newton’s square root method. First you make an initial guess at the square root of n, say a. Then repeatedly update your guess to be the average of your previous guess a and n/a:

a ← (a + n/a) /2.

This is an ancient algorithm, going back to at least 1000 BC in Babylon. We call it Newton’s method because it is a special case of Newton’s root-finding method, applied to f(x) = x² − a.

Newton’s square root algorithm is implemented in computer hardware to this day. In some sense, we still compute square roots the same way the ancient Babylonians did.

Without floating point arithmetic

Our original problem was to find an integer, the largest integer no greater than the integer n. Can we take advantage of the fact that we’re working with integers?

A general approach to solving problems over the integers is to solve the same problem over the real numbers, then convert the result to an integer. That’s what we did above, but what if we incorporate the “integerification” into Newton’s algorithm itself so that we never work with real numbers?

This is a reasonable idea, but we don’t know a priori that it will work. Sometimes integer problems can be solved by integer analogs of real algorithms, but sometimes not. Integer optimization problems are harder than real number optimization problems because the optimal integer solution cannot in general be found by first finding the optimal real solution and rounding. And even when this approach works, it may not be the most efficient approach.

It turns out that the integer analog of Newton’s method does work. Bressoud [1] gives pseudocode for an algorithm which I write in Python below.

def sqrt_floor(n):
    a = n
    b = (n + 1) // 2
    while b < a:
        a = b
        b = (a*a + n) // (2*a)
    return a

Note that in Python the // operator carries out integer division, i.e. a // b computes ⌊a/b⌋.

The somewhat opaque line updating b inside the loop is just Newton’s method, averaging a and n/a, rounded down.

This algorithm could run on hardware with no floating point capability, only integer arithmetic. It will also work on extended precision integers where floating point would fail. For example, the following Python code

from math import factorial, sqrt
sqrt(factorial(1000))

produces an error “OverflowError: int too large to convert to float.” But

sqrt_floor(factorial(1000))

works just fine.

Update: FIPS 184-5

After writing this post, I was looking through the NIST FIPS 184-5 Digital Signature Standard (DSS). In Appendix B.4 the standard gives an algorithm for determining whether a large integer is a square. The algorithm in the appendix is essentially the algorithm above. The line that carries out the Newton method update is not explained, but this post explains where this otherwise mysterious line comes from.

[1] David M. Bressoud. Factorization and Primality Testing. Springer, 1989.