Solving hard problems

We help companies solve hard problems in mathematics, statistics, and computing. Let’s explore how we might work together.

How quantum computing would affect Bitcoin

Posted on 30 August 2025 by John

Bitcoin relies on two kinds of cryptography: digital signatures and hash functions. Quantum computing would be devastating to the former, but not the latter.

To be more specific, the kind of digital signatures used in Bitcoin could in theory be broken by quantum computer using Shor’s algorithm. Digital signatures could use quantum-resistant algorithms [1], but these algorithms are not used in Bitcoin (or much of anything else) at this time.

A quantum computer could use Grover’s algorithm to reduce the time required to reverse hash functions, but this is a much smaller threat. For this post, we will simply assume a quantum computer could break digital signatures but not hash functions.

Digital signatures rely on public key cryptography, but Bitcoin uses something you might call not-so-public key cryptography. Public keys are not necessarily made public. You can receive funds sent to a hash of your public key, but to spend funds you have to reveal your unhashed public key.

This asymmetry means that presumably you could receive funds and be safe from quantum computers as long as you never spend the funds. But once you spend the funds, your public key really is public, published on the blockchain, and a quantum computer could derive your private key from your public key. This is because digital signatures are verified when money is spent, not when it is received.

It is widely believed that practical quantum computing is at least several years away. Maybe large-scale quantum computing is decades away, and maybe it will never happen. Nobody knows. But if quantum computing becomes practical before the world shifts to quantum-resistant cryptography, it would be a catastrophe. It’s hard to appreciate how thoroughly public key cryptography is woven into contemporary life.

Could there be a quantum computer sitting in the basement of the NSA right now? I doubt the NSA could be a decade ahead of public quantum computing efforts, but this is just speculation. Classified cryptography research has historically been ahead of open research. For example, differential cryptanalysis was discovered long before it was made public. I expect the gap between classified and open research is smaller now that there is much more public research in cryptography, but presumably there is still some gap.

[1] NIST recommends ML-DSA (Dilithium) and SLH-DSA (SPHINCS+) for post-quantum digital signature algorithms. See FIPS 204 and FIPS 205.

Storing data in images

Posted on 29 August 2025 by John

This post will connect a couple posts from yesterday and explore storing data in images.

Connections

There’s a connection between two blog posts that I wrote yesterday that I only realized today.

The first post was about the probability of sending money to a wrong Bitcoin address by mistyping. Checksums make it extremely unlikely that a typo would result in an invalid address. And if somehow you got past the checksum, you would very likely send money to an unused address rather than an unintended used address.

The second post was about QR codes that embed images. I give a couple examples in the post. The first is a QR code that looks like the symbol for a chess king that links to my Kingwood Data Privacy site. The second is a QR code that looks like the icon for @CompSciFact and links to that account’s page on X.

Someone responded to the first post saying that nobody types in Bitcoin addresses; they scan QR codes. The second post suggests QR codes are very robust. I wrote the post just having fun with the novelty of using QR codes in an unintended way, but there is a pragmatic point as well. The fact that you can play with the pixel sizes enough to create images means that there is a lot of safety margin in QR codes.

If you send Bitcoin by scanning a QR code, it’s very likely that you will scan the code correctly. The individual pixels are robust, and QR codes have checksums built in. If somehow you managed to incorrectly scan a QR code for an address, bypassing QR code checksums, then the protections mentioned in the typo post come into play. The checksums in a Bitcoin address apply whether mistype or mis-scan an address.

Storing data on paper

QR codes are a way to store a relatively small amount of data in an image. How much data could you store in an image the size of a sheet of paper?

Ondřej Čertík did an experiment to find out. He wanted to see how much data he could store on a piece of paper using a regular printer and reliably retrieve the data using a phone camera. You can find his results here.

QR code internals

I’ve thought about blogging about how the error detection and correction in QR codes works, but it’s too complicated. I like writing fairly short articles, and QR error correction is too complicated to describe in say 300 words. Other people have written long articles on how it works, and you can find these articles if you’re interested.

It might be possible to write a short article about why it takes a long article to describe QR code error correction. That is, it might be possible to describe briefly the reasons why the algorithm is so complicated. I imagine part of the complexity is just historical contingency, but I also imagine the problem has some intrinsic complexity that isn’t obvious at first glance.

Ondřej used Golay codes for error correction, and it worked fairly well. It would be interesting to compare the robustness of QR codes to a design that simply used Golay codes.

From Bitcoin to Jupiter

James Burke hosted a BBC series called Connections in which he would trace unexpected lines of connections. For example, I recall watching one episode in which he traced the connection between a touchstone and nuclear weapons.

In this post we’ve connected Jupiter to Bitcoin. How? Bitcoin addresses lead to a discussion of QR codes, which lead to discussing Ondřej’s experiment, which led to Golay codes, which were used on a Voyager probe to Jupiter.

An uncrossed knight’s tour

Posted on 29 August 2025 by John

I’ve written several times about knight’s tours of a chessboard. The paths in these tours cross each other many times. What if you wanted to look tours that do not cross themselves? You can’t reach every square this way. You can reach half of them, but no more than half.

The following tour is part of the article Uncrossed Knight’s Tours in Donald Knuth’s book Selected Papers on Fun & Games.

Dithered QR codes

Posted on 28 August 2025 by John

I saw a post by Dave Richeson on Mastodon about making QR codes that look like images. Turns out you can shrink a black square in a QR code by up to a factor of three while keeping the code usable. This gives you the wiggle room to create dithered images.

I tried a few examples using this software. The code works with photos, but it works really well with simpler images like the ones below.

Here’s one for my data privacy company, Kingwood Data Privacy LLC.

kingwooddataprivacy.com

And here’s one for my account @CompSciFact on X.

Probability of typing a wrong Bitcoin address

Posted on 28 August 2025 by John

I heard someone say that Bitcoin is dangerous because you could easily make a typo when entering an address, sending money to the wrong person, and have no recourse. There are dangers associated with Bitcoin, such as losing a private key, but address typos are not a major concern.

Checksums

There are several kinds of Bitcoin addresses. Each is at least 20 bytes (160 bits) long, with at least 4 bytes (32 bits) of checksum. The chances of a typo resulting in a valid checksum are about 1 in 2³².

Used addresses

Let’s ignore the checksum for this section.

Because addresses are formed by cryptographic hash functions, we can assume the values are essentially randomly distributed in the space of possible addresses. The addresses are deterministic, but for modeling purposes, random is as random does.

This means a typo of an actual address is no more or less likely to be another actual address than an address typed at random. This is unlike, say, English words: a mistyped English word is more likely to be another English word than random keystrokes would be.

There have been on the order of a billion Bitcoin addresses used, in a space of 2¹⁶⁰ possibilities. (Actually more since some addresses have more than 160 bits.) There’s about a 1 in 10³⁹ chance that a random 160-bit sequence corresponds to an address somewhere on the Bitcoin blockchain.

If you sent money to a non-existent address, you still lose your money, but it’s extremely unlikely that you would accidentally send money to a real address that you didn’t intend.

Addresses close in edit distance

Someone with the Caesarean handle Veni Vidi Vici on X asked

What about the odds that out of those 1B addresses, two of them are one character swap away from each other?

That’s an interesting question. Let’s assume the addresses are Base58-encoded strings of length 26. Addresses could be longer, but assuming the minimum length increases the probability of addresses being close.

How many addresses are within one or two character swaps of another? I addressed a similar question here a couple weeks ago. If all the characters were unique, the number of strings within k swaps of each other would be

|S₁(26, 26 − k)|

where S₁ denotes Stirling numbers of the first kind. For k = 1 this would be 325 and for k = 2 this would be 50,050. This assumes all the characters are unique; I haven’t thought through the case where characters are repeated.

For round numbers, let’s say there are a billion addresses, and for each address there are a million other addresses that are close in some sense, plausible typos of the address. That would be 10¹² addresses and typos, spread out in a space of ≈10⁴⁵ (i.e. 58²⁶) possible addresses.

Now there’s an implicit Birthday Problem here. No particular address is likely to collide with another, even when you allow typos, but what about the likelihood that some address collides?

Say we partition our space of 10⁴⁵ addresses into N = 10²⁹ addresses with a million possible typos for each address. Then as a rule of thumb, you’d need around √N random draws before you have a 50-50 chance of seeing a collision. Since 10⁹ is a lot less than 10^14.5, it’s unlikely that any two addresses collide, even when you consider each address along with a million associated typos.

Why are CUDA kernels hard to optimize?

Posted on 27 August 2025 by Wayne Joubert

Explosive datacenter demand has caused developers to leave no stone unturned in search of higher efficiencies. The DeepSeek team, not satisfied with Nvidia’s CUDA libraries, used a virtualized form of assembly language (PTX) to write kernel codes to accelerate their AI computations. Others have attempted to generate optimized kernels using AI, though some results have been questioned (for various attempts, see also here, here, here, here and here).

Why is it hard to write peak-speed GPU code? Writing really fast code has always been arduous, but it seems especially so for modern GPUs.

To understand the issues, my colleagues and I performed a detailed study of GPU kernel performance, across eight different GPU models from three GPU vendors [1]. The test case we considered was low precision matrix multiply, a resource-intensive operation for LLM training. We ran many, many experiments to understand what causes performance variability and why kernels sometimes run slower than you’d think they should.

For the cases we studied, we found about half a dozen different factors, but the upshot is this: modern processors like GPUs have become so complex—notably their multi-layered hierarchical memory subsystems—that it is difficult to get consistently high performance across all problem sizes a user might want to run in practice. As a result, the performance for the target problem might be surprisingly and mysteriously less than the advertised peak performance for the operation in question. The reasons might be obvious—like cache line misalignment—or more opaque. For the matrix multiply case, various issues like the need for prefetching, caching, tiling and block size selection, make it difficult for the kernel developer to optimize for every input size a user might specify.

Below is an example graphic from our paper. The color indicates floating point operation rate (FLOPs) for a reduced precision matrix multiply on a representative GPU using a library call. The horizontal and vertical axes refer to the matrix dimensions for the problem (see paper for details). Though some regions show performance near the theoretical peak (red), other immediately adjacent regions show problem sizes that run dramatically less—in fact, only about half of peak performance, or less. Presumably this is because either individual kernel performance or the selection of kernels used by the library is suboptimal. The net outcome is, if your problem lands a “bad” region, you’re in for a big surprise, your performance will be much less than expected, and you may not understand why. All high-performing GPUs we tested showed irregular behaviors such as this [2] [3].

In the past this was not always a problem. Older architectures like Sun Sparc or Cray vector processor, complex as they were, were simple enough that a reasonably well-tuned computational kernel might run well across most if not all inputs [4]. Today, performance is much harder to predict and can vary substantially based on the requested problem sizes.

This is a tough challenge for library developers. Whenever a new GPU model family comes out, new kernel optimization and tuning are required to give (hopefully) more consistently high performance, and some cases get more developer attention than others due to customer needs and limited developer resources. As a result, infrequently used operations do not get as much attention, but they may be the exact ones you need for your particular case [5].

Tools are available to help optimize for specific cases. The excellent Nvidia CUTLASS library exposes access to many more fine-grained options compared to the standard cuBLAS library. The not faint of heart can try programming Nvidia GPUs at the level of PTX, or (shudder) SASS. Superoptimization might help, but only for very small code fragments and even then there may be too many external factors influencing performance to make it effective.

Autotuning is a promising approach though it doesn’t seem to have reached its full potential in production. AI might really help here [6]; in our own paper we had some success using machine learning methods like decision trees and random forests to model performance as a function of problem size, though our work was exploratory and not production-ready. To make a well-crafted general solution it would seem would require a lot of effort to do right. Code sustainability and maintenance are also critical; a sustainable workflow would be needed to retrain on new GPUs, new CUDA releases and even site-specific and system-specific settings like GPU power and frequency cap policies.

Most recent AI-driven work focuses on optimizing performance for one or a few problem sizes only. A truly production-quality general purpose tool would give both 100% accurate results and also top achievable performance for any input problem size (even for corner cases) or data type. This would require both optimized GPU kernels and optimal kernel dispatcher for kernel selection. And the method would need to be robust to issues like power and frequency variabilities in production runs. This would seem to currently be an unsolved problem. Solving it would be of huge benefit to the hyperscaler community.

Notes

[1] For related work from a slightly different angle, see this excellent work from Matt Sinclair’s lab.

[2] It turned out this study was helpful to us for production runs, to help us to triage an odd performance conundrum we encountered when attempting an exascale run (see here, here).

[3] Incidentally this example shows the hazards of simplistic benchmark suites to measure GPU code performance. Unless the benchmark captures a truly large and varied set of input cases, any new optimization method proposed can artificially “overfit” performance on the tests and still underperform miserably on many user cases of interest.

[4] I once wrote a 1-D wavelet convolution kernel for a Sparc processor, using a circular register buffer and loop unrolling to minimize loads and stores, this achieving near-peak performance. The code was correctly compiled from C to assembly, and performance for a given problem was almost precisely predictable. That was before the days of complex memory hierarchies.

[5] One vendor I know of used to take customer requests for hand tuning expensive library calls and made them run fast at the specific customer problem sizes.

[6] LLM kernel generation seems like a natural fit, particularly since LLM-generated code quality has much improved in recent months. Kernel selection and parameter selection for block size, tiling etc. might be better solved by direct training of machine learning models, or methods like this. Comparative studies on this would be informative.

The biggest math symbol

Posted on 27 August 2025 by John

The biggest math symbol that I can think of is the Riemann P-symbol

$w(z)=P \left\{ \begin{matrix} a & b & c & \; \\ \alpha & \beta & \gamma & z \\ \alpha^\prime & \beta^\prime & \gamma^\prime & \; \end{matrix} \right\}$

The symbol is also known as the Papperitz symbol because Erwin Papperitz invented the symbol for expressing solutions to Bernard Riemann’s differential equation.

Before writing out Riemann’s differential equation, we note that the equation has regular singular points at a, b, and c. In fact, that is its defining feature: it is the most general linear second order differential equation with three regular singular points. The parameters a, b, and c enter the equation in the as roots of an expression in denominators; that’s as it has to be if these are the singular points.

The way the Greek letter parameters enter Riemann’s equation is more complicated, but there is a good reason for the complication: the notation makes solutions transform as simply as possible under a bilinear transformation. This is important because Möbius transformations are the conformal automorphisms of the Riemann sphere.

To be specific, let

$m(z) = \frac{Az + B}{Cz + D}$

be a Möbius transformation. Then

$P \left\{ \begin{matrix} a & b & c & \; \\ \alpha & \beta & \gamma & z \\ \alpha^\prime & \beta^\prime & \gamma^\prime & \; \end{matrix} \right\} = P \left\{ \begin{matrix} m(a) & m(b) & m(c) & \; \\ \alpha & \beta & \gamma & m(z) \\ \alpha^\prime & \beta^\prime & \gamma^\prime & \; \end{matrix} \right\}$

Since the parameters on the top row of the P-symbol are the locations of singularities, when you transform a solution, moving the singularities, the new parameters have to be the new locations of the singularities. And importantly the rest of the parameters do not change.

Now with the motivation aside, we’ll write out Riemann’s differential equation in all its glory.

$\frac{d^2w}{dz^2} + p(z) \frac{dw}{dz} + q(z) \frac{w}{(z-a)(z-b)(z-c)} = 0$

where

$p(z) = \frac{1-\alpha-\alpha^\prime}{z-a} + \frac{1-\beta-\beta^\prime}{z-b} + \frac{1-\gamma-\gamma^\prime}{z-c}$

and

$q(z) = \frac{\alpha\alpha^\prime (a-b)(a-c)} {z-a} +\frac{\beta\beta^\prime (b-c)(b-a)} {z-b} +\frac{\gamma\gamma^\prime (c-a)(c-b)} {z-c}$

You can’t have everything you want: beta edition

Posted on 26 August 2025 by John

The beta distribution is a conjugate prior for a binomial likelihood function, so it makes posterior probability calculations trivial: you simply add your data to the distribution parameters. If you start with a beta(α, β) prior distribution on a proportion θ, then observe s successes and f failures, the posterior distribution on θ is

beta(α + s, β + f).

There’s an integration going on in the background, but you don’t have to think about it. If you used any other prior, you’d have to calculate an integral, and it wouldn’t have a simple closed form.

The sum α + β of the prior distribution parameters is interpreted as the number of observations contained in the prior. For example, a beta(0.5, 0.,5) prior is non-informative, having only as much information as one observation.

I was thinking about all this yesterday because I was working on a project where the client believes that a proportion θ is around 0.9. A reasonable choice of prior would be beta(0.9, 0.1). Such a distribution has mean 0.9 and is non-informative, and it would be fine for the use I had in mind.

Non-informative beta priors work well in practice, but they’re a little unsettling if you plot them. You’ll notice that the beta(0.9, 0.1) density has singularities at both ends of the interval [0, 1]. The singularity on the right end isn’t so bad, but it’s curious that a distribution designed to have mean 0.9 has infinite density at 0.

If you went further and used a completely non-informative, improper beta(0, 0) prior, you’d have strong singularities at both ends of the interval. And yet this isn’t a problem in practice. Statistical operations that are not expressed in Bayesian terms are often equivalent to a Bayesian analysis with an improper prior like this.

This raises two questions. Why must a non-informative prior put infinite density somewhere, and why is this not a problem?

Beta densities with small parameters have singularities because that’s just the limitation of the beta family. If you were using a strictly subjective Bayesian framework, you would have to say that such densities don’t match anyone’s subjective priors. But the beta distribution is very convenient to use, as explained at the top of this post, and so everyone makes a little ideological compromise.

As for why singular priors are not a problem, it boils down to the difference between density and mass. A beta(0.9, 0.1) distribution has infinite density at 0, but it doesn’t put much mass at 0. If you integrate the density over a small interval, you get a small probability. The singularity on the left is hard to see in the plot above, almost a vertical line that overlaps the vertical axis. So despite the curve going vertical, there isn’t much area under that part of the curve.

You can’t always get what you want. But sometimes you get what you need.

More on seed phrase words

Posted on 26 August 2025 by John

Last week I wrote about how the English seed phrase words for crypto wallets, proposed in BIP39, are not ideal for memorization. This post gives a few more brief thoughts based on these words.

Prefix uniqueness

The BIP39 words have a nice property that I didn’t mention: the words are uniquely determined by their first four letters. This means, for example, that you could type in a seed phrase on a phone by typing the first four letters of each word and a wallet interface could fill in the rest of each word.

Incidentally, although the BIP39 words are unique in how they begin, they are not unique in how they end. For example, cross and across are both on the list.

Creating a list

I wondered how hard it would be to come up with a list of 2048 common words that are unique in their first four letters. So I started with Google’s list of 10,000 most common words.

I removed one- and two-letter words, and NSFW words, to try to create a list similar to the BIP39 words. That resulted in a list of 4228 words. You could delete over half of these words and have a list of 2048 words uniquely determined by their first four letters.

Comparing lists

I was curious how many of the BIP39 list of 2048 words were contained in Google’s list of 10,000 most common words. The answer is 1666, or about 81%. (Incidentally, I used comm to answer this.)

Venn diagram

Vocabulary estimation and overlap

Here’s something else I was curious about. The average adult has an active vocabulary of between 20,000 and 35,000 words. So it seems reasonable that a typical adult would know nearly all the words on Google’s top 10,000 list. (Not entirely all. For example, I noticed one word on Google’s list that I hadn’t seen before.)

Now suppose you had a list of the 20,000 most common words and a person with a vocabulary of 20,000 words. How many words on the list is this person likely to know? Surely not all. We don’t acquire our vocabulary by starting at the top of a list of words, ordered by frequency, and working our way down. We learn words somewhat at random according to our circumstances. We’re more likely to know the most common words, but it’s not certain. So how would you model that?

It’s interesting to think how you would estimate someone’s vocabulary in the first place. You can’t give someone a quiz with every English word and ask them to check off the words they know, and estimation complicated by the fact that word frequencies are far from uniform. Maybe the literature on vocabulary estimation would answer the questions in the previous paragraph.

Variations on Knuth’s Twindragon

Posted on 25 August 2025 by John

A couple days ago I wrote about Donald Knuth’s expression for the twindragon fractal as a sum of powers of b = 1 − i.

Simone Conradi made a nice animation replacing (1 − i) with exp(2πit) (1 − i). The animation loops over values of t.

Here’s what you get when t = 0.3.

And here’s what you get when t = 0.4.

Solving hard problems

How quantum computing would affect Bitcoin

Related posts

Storing data in images

Connections

Storing data on paper

QR code internals

From Bitcoin to Jupiter

An uncrossed knight’s tour

Related posts

Dithered QR codes

Probability of typing a wrong Bitcoin address

Checksums

Used addresses

Addresses close in edit distance

Why are CUDA kernels hard to optimize?

Notes

The biggest math symbol

Related posts

You can’t have everything you want: beta edition

Related posts

More on seed phrase words

Prefix uniqueness

Creating a list

Comparing lists

Vocabulary estimation and overlap

Related posts

Variations on Knuth’s Twindragon