Bech32 encoding

Bech32 is an algorithm for encoding binary data, specifically Bitcoin addresses, in a human-friendly way using a 32-character alphabet. The Bech32 alphabet includes lowercase letters and digits, removing the digit 1, and the letters b, i, and o.

The Bech32 alphabet design is similar to that for other coding schemes in that it seeks to remove letters that are visually similar. However, the order of the alphabet is unusual:

    qpzry9x8gf2tvdw0s3jn54khce6mua7l

I’ll explain the motivation for the ordering shortly. Here’s the same alphabet in conventional order, not the order used by Bech32:

    023456789acdefghjklmnpqrstuvwxyz

The Bech32 algorithm does more than represent 5-bit words as alphanumeric characters. The full algorithm is documented here. The algorithm is applied to the output of a RIPEMD-160 hash, and so the number of bits is a multiple of 5.

The Bech32 algorithm includes a BCH (Bose–Chaudhuri–Hocquenghem) checksum, and takes its name from this checksum.

Note that the BCH checksum is not a cryptographic hash; it’s an error correction code. It’s purpose is to catch mistakes, not to thwart malice. Contrast this with Base58check that uses part of a SHA256 hash. Base58check can detect errors, but it cannot correct errors.

I’ve had a surprisingly hard time finding the specifics of the BCH(nd) code that Bech32 uses. Every source I’ve found either does not give values of n and d or gives different values. In any case, the order of the Bech32 alphabet was chosen so that mistakes humans are more like to make are more likely mistakes that BCH can detect and correct.

Related posts

Vanity addresses

Bitcoin addresses are essentially hash values of public keys encoded in Base58. More details here.

The addresses are essentially random characters chosen from the Base58 alphabet: uppercase and lowercase Latin letters and digits, with 0 (zero), I (capital I), O (capital O), and l (lowercase l) removed to prevent errors.

You could create an address that begins with chosen characters by generating private/public key pairs and hashing the latter until you get the initial string you want [1]. People have done this. The vanitygen software gives the example of searching for an address that starts with “Love” (after the mandatory initial 1).

$ ./vanitygen 1Love
...                         
Pattern: 1Love
Address: 1LoveRg5t2NCDLUZh6Q8ixv74M5YGVxXaN
Privkey: 5JLUmjZiirgziDmWmNprPsNx8DYwfecUNk1FQXmDPaoKB36fX1o

Search time

You can’t specify a long prefix. The difficulty in finding an address with n specified initial characters grows exponentially, like 58n.

Say it takes a minute on average to find an address with the desired first four characters. Each additional character makes the search take 58 times longer, so specifying 5 characters would take about an hour. If you wanted to find an address that begins 1AcmeCorp it would take 584 minutes or over 21 years.

Disadvantages

There are several reasons you might not want to use a vanity address.

First, it’s now customary to generate new addresses for each transaction. This gives Bitcoin some degree of privacy, though not much. Presumably if you go to the effort of creating a custom address you’d like to hold on to it. Maybe you’d like to use it as an address for someone to send you donations, for example, and you’re OK reusing the address.

Second, if you use software someone else wrote to generate the custom address, the site may hold onto your private key or send it somewhere.

Third, you can’t use the address with an HD wallet. Related to the first point, HD wallets generate a sequence of key pairs, and thus addresses, according to some deterministic algorithm, and so it can’t hold an address it didn’t generate. But there are wallets that will hold such addresses.

Proof of work

Incidentally, the proof-of-work task for Bitcoin is very similar to finding vanity addresses. You take a block and twiddle what you can until it hashes to a value with a target number of leading zero bits. More on that here.

El Salvador’s Bitcoin and Quantum Computing

The treasury of El Salvador owns over 6,000 Bitcoins. Its total holdings are currently worth roughly $700,000,000. These coins had been associated with one private key. Yesterday El Salvador announced that it would split its funds into 14 wallets in order to protect the funds from quantum computing. You can confirm using a blockchain explorer that this has indeed happened.

The most oversimplified takes say El Salvador has protected its Bitcoin reserves from a quantum computing attack by moving from one wallet to 14 wallets. But moving money from one private key to 14 private keys does not it itself offer any protection against quantum computing. It only means that stealing the funds would take 14 times longer. Unsurprisingly the original announcement is more nuanced than the online commentary on the announcement.

For reasons explained in the previous post, Bitcoins are safe from quantum attack until they are spent. You don’t need to reveal your public key to receive Bitcoins, only a hash of your public key. But you do need to reveal your public key to spend Bitcoins.

As far as I know, the one key that until recently held all of El Salvador’s Bitcoin reserves had never spent coins, and so the public key was never made public. If so, the reserves were already safe from quantum computing attacks, at least to the extent that the reserves are safe now. So, strictly speaking, yesterday’s move did not increase the security of the reserves from quantum attack.

What yesterday’s move accomplished was making it easier to spend part of the reserves. Now one of the wallets could spend reserves, revealing the public key associated with part of the reserves rather than a single key associated with all the reserves.

Splitting one large wallet into multiple smaller wallets is a good move, for reasons that have nothing to do with quantum computing. But there is a quantum angle here, namely that the split reduces the risk of quantum computing attack when some of the funds are spent.

There are proposals to harden the Bitcoin protocol against quantum attacks, and perhaps El Salvador can avoid spending from some or all of its wallets until one of these proposals is implemented.

Related posts

How quantum computing would affect Bitcoin

Bitcoin relies on two kinds of cryptography: digital signatures and hash functions. Quantum computing would be devastating to the former, but not the latter.

To be more specific, the kind of digital signatures used in Bitcoin could in theory be broken by quantum computer using Shor’s algorithm. Digital signatures could use quantum-resistant algorithms [1], but these algorithms are not used in Bitcoin (or much of anything else) at this time.

A quantum computer could use Grover’s algorithm to reduce the time required to reverse hash functions, but this is a much smaller threat. For this post, we will simply assume a quantum computer could break digital signatures but not hash functions.

Digital signatures rely on public key cryptography, but Bitcoin uses something you might call not-so-public key cryptography. Public keys are not necessarily made public. You can receive funds sent to a hash of your public key, but to spend funds you have to reveal your unhashed public key.

This asymmetry means that presumably you could receive funds and be safe from quantum computers as long as you never spend the funds. But once you spend the funds, your public key really is public, published on the blockchain, and a quantum computer could derive your private key from your public key. This is because digital signatures are verified when money is spent, not when it is received.

It is widely believed that practical quantum computing is at least several years away. Maybe large-scale quantum computing is decades away, and maybe it will never happen. Nobody knows. But if quantum computing becomes practical before the world shifts to quantum-resistant cryptography, it would be a catastrophe. It’s hard to appreciate how thoroughly public key cryptography is woven into contemporary life.

Could there be a quantum computer sitting in the basement of the NSA right now? I doubt the NSA could be a decade ahead of public quantum computing efforts, but this is just speculation. Classified cryptography research has historically been ahead of open research. For example, differential cryptanalysis was discovered long before it was made public. I expect the gap between classified and open research is smaller now that there is much more public research in cryptography, but presumably there is still some gap.

Related posts

[1] NIST recommends ML-DSA (Dilithium) and SLH-DSA (SPHINCS+) for post-quantum digital signature algorithms. See FIPS 204 and FIPS 205.

Probability of typing a wrong Bitcoin address

I heard someone say that Bitcoin is dangerous because you could easily make a typo when entering an address, sending money to the wrong person, and have no recourse. There are dangers associated with Bitcoin, such as losing a private key, but address typos are not a major concern.

Checksums

There are several kinds of Bitcoin addresses. Each is at least 20 bytes (160 bits) long, with at least 4 bytes (32 bits) of checksum. The chances of a typo resulting in a valid checksum are about 1 in 232.

Used addresses

Let’s ignore the checksum for this section.

Because addresses are formed by cryptographic hash functions, we can assume the values are essentially randomly distributed in the space of possible addresses. The addresses are deterministic, but for modeling purposes, random is as random does.

This means a typo of an actual address is no more or less likely to be another actual address than an address typed at random. This is unlike, say, English words: a mistyped English word is more likely to be another English word than random keystrokes would be.

There have been on the order of a billion Bitcoin addresses used, in a space of 2160 possibilities. (Actually more since some addresses have more than 160 bits.) There’s about a 1 in 1039 chance that a random 160-bit sequence corresponds to an address somewhere on the Bitcoin blockchain.

If you sent money to a non-existent address, you still lose your money, but it’s extremely unlikely that you would accidentally send money to a real address that you didn’t intend.

Addresses close in edit distance

Someone with the Caesarean handle Veni Vidi Vici on X asked

What about the odds that out of those 1B addresses, two of them are one character swap away from each other?

That’s an interesting question. Let’s assume the addresses are Base58-encoded strings of length 26. Addresses could be longer, but assuming the minimum length increases the probability of addresses being close.

How many addresses are within one or two character swaps of another? I addressed a similar question here a couple weeks ago. If all the characters were unique, the number of strings within k swaps of each other would be

|S1(26, 26 − k)|

where S1 denotes Stirling numbers of the first kind. For k = 1 this would be 325 and for k = 2 this would be 50,050. This assumes all the characters are unique; I haven’t thought through the case where characters are repeated.

For round numbers, let’s say there are a billion addresses, and for each address there are a million other addresses that are close in some sense, plausible typos of the address. That would be 1012 addresses and typos, spread out in a space of ≈1045 (i.e. 5826) possible addresses.

Now there’s an implicit Birthday Problem here. No particular address is likely to collide with another, even when you allow typos, but what about the likelihood that some address collides?

Say we partition our space of 1045 addresses into N = 1029 addresses with a million possible typos for each address. Then as a rule of thumb, you’d need around √N random draws before you have a 50-50 chance of seeing a collision. Since 109 is a lot less than 1014.5, it’s unlikely that any two addresses collide, even when you consider each address along with a million associated typos.

More on seed phrase words

Last week I wrote about how the English seed phrase words for crypto wallets, proposed in BIP39, are not ideal for memorization. This post gives a few more brief thoughts based on these words.

Prefix uniqueness

The BIP39 words have a nice property that I didn’t mention: the words are uniquely determined by their first four letters. This means, for example, that you could type in a seed phrase on a phone by typing the first four letters of each word and a wallet interface could fill in the rest of each word.

Incidentally, although the BIP39 words are unique in how they begin, they are not unique in how they end. For example, cross and across are both on the list.

Creating a list

I wondered how hard it would be to come up with a list of 2048 common words that are unique in their first four letters. So I started with Google’s list of 10,000 most common words.

I removed one- and two-letter words, and NSFW words, to try to create a list similar to the BIP39 words. That resulted in a list of 4228 words. You could delete over half of these words and have a list of 2048 words uniquely determined by their first four letters.

Comparing lists

I was curious how many of the BIP39 list of 2048 words were contained in Google’s list of 10,000 most common words. The answer is 1666, or about 81%. (Incidentally, I used comm to answer this.)

Venn diagram

Vocabulary estimation and overlap

Here’s something else I was curious about. The average adult has an active vocabulary of between 20,000 and 35,000 words. So it seems reasonable that a typical adult would know nearly all the words on Google’s top 10,000 list. (Not entirely all. For example, I noticed one word on Google’s list that I hadn’t seen before.)

Now suppose you had a list of the 20,000 most common words and a person with a vocabulary of 20,000 words. How many words on the list is this person likely to know? Surely not all. We don’t acquire our vocabulary by starting at the top of a list of words, ordered by frequency, and working our way down. We learn words somewhat at random according to our circumstances. We’re more likely to know the most common words, but it’s not certain. So how would you model that?

It’s interesting to think how you would estimate someone’s vocabulary in the first place. You can’t give someone a quiz with every English word and ask them to check off the words they know, and estimation complicated by the fact that word frequencies are far from uniform. Maybe the literature on vocabulary estimation would answer the questions in the previous paragraph.

Related posts

What’s hierarchical about a hierarchical wallet?

A few days ago I wrote about what’s in a crypto wallet. In that post I said that most crypto wallets now are hierarchical deterministic (HD) wallets.  And I said that HD wallets are deterministic in the sense that they derive all their keys from a seed phrase. But in what sense are HD wallets hierarchical? That’s the topic of this post.

A warm-up story

In the game of 20 questions, one person thinks of something and another tries to guess what it is by asking up to 20 yes-no questions. I once heard the physicist John Wheeler tell of a variation of this game in which the first person did not have a definite object in mind, but decided after each question what the answer should be. For example, if someone asks “Is this person a man?” the person would commit to the person being a man or woman, but would not decide on a particular man or woman yet.

Wheeler’s point was that quantum mechanics is like this variation on 20 questions in that the answers to questions don’t exist until the question is asked. What does this have to do with hierarchical deterministic wallets? Your private keys do not exist until you ask for them. But once you have created and used a key, a wallet will behave consistently with that creation.

The hierarchy

The hierarchy referred to in a hierarchical deterministic wallet is a set of five variables, as described in BIP-44:

  1. Purpose
  2. Coin type
  3. Account
  4. Change
  5. Address index

The meaning of the variables is explained in BIP-44. The lowest level, address index, is a sequential counter. So you can have separate sequential counters for each value of the four-tuple (purpose, coin type, account, change).

Your master key and the five variables above are inputs to a key derivation function used to create new keys as needed. Once you use a private key, a hash of its corresponding public key is memorialized on the blockchain. If it’s a Bitcoin transaction, it’s on the Bitcoin blockchain. If it’s an Ethereum transaction, it’s on the Ethereum blockchain, etc. (You can find a list of supported coin types here.)

You wallet does not (or at least logically need not) store all your keys. It can reason as follows. “If the master key and these hierarchical values were used, this would be the private key. And given this private key, this would be the public key, and this would be the corresponding address. Let me consult the blockchain to see whether in fact it was used.”

How would a wallet know how many transactions you’ve made under a particular branch of the hierarchy? It searches the corresponding blockchain. It first looks whether there is a ledger entry corresponding to address index 0. Then address index 1, etc. The algorithm allows for the possibility of gaps. If it cannot find a ledger entry corresponding to index 2, it looks for index 3, etc. up to a gap of 20. After looking ahead 20 index values and finding nothing, it concludes there is nothing else to be found.

Because everything is derived deterministically from the seed phrase and the hierarchical variables, you can back up a wallet by simply backing up the seed phrase.

In theory, you could carry out transactions using one brand of wallet, back it up by writing down the seed phrase, then restore the information to a different brand of wallet. In practice you may run into difficulty doing this.

Related posts

A lot of seed phrase words are similar

A couple days ago I wrote about how you might go about trying to recover a seed phrase that you had remembered out of order. I said that the list of seed phrase words had been designed to be distinct. Just out of curiosity I computed how similar the words are using Levenshtein distance, also known as edit distance, the number of single character edits it takes to turn one word into another.

A lot of the words—484 out of 2048—on the BIP39 list differ from one or more other words by only a single character, such as angle & ankle, or loud & cloud. The word wine is one character away from each of wing, wink, wire, and wise.

Other kinds of similarity

Edit distance may not the best metric to use because it measures differences in text representation. It’s more important for words to be conceptually or phonetically distinct than to be distinct in their spelling. For example, the pair donkey & monkey differ by one letter but are phonetically and conceptually distinct, as are the words liveolive.

Some pairs of words are very similar phonetically. For example, I wouldn’t want to have to distinguish or cannon & canyon over a phone call. The list is not good for phonetic distinction, unlike say the NATO alphabet.

Memorization

For ease of memorization, you want words that are vivid and concrete, preferably nouns. That would rule out pairs like either & neither.

The BIP39 list of words is standard. But other approaches, such as Major system encoding, are more optimized for memorability.

Design

It’s hard to make a long list of words distinct by any criteria, and 2048 is a lot of words. And the words on the list are intended to be familiar to everyone. Adding more vivid or distinct words would risk including words that not everyone would know. But still, it seems like it might have been possible to create a better word list.

Recovery

The earlier post discussed how to recover a seed phrase assuming that all the words are correct but in the wrong order. It would make sense to explore sequences in order of permutation distance, assuming that small changes to the order are more likely than large changes.

But if it’s possible that the words are not correct, you might try looking at words in edit distance order. For example, “You said one of the words was race. Could it have been rice?”

Related posts

Recovering a permuted seed phrase

As mentioned in the previous post, crypto wallets often turn a list of words, known as a seed phrase, into private keys. These words come from a list of 211 = 2048 words chosen to be distinct and memorable. You can find the list here. Typically a seed phrase will contain 12 words. For example:

prize, differ, year, cloth, require, wild, clean, kit, cart, knock, biology, above

Each word carries 11 bits of information, so in total this represents 132 bits, which is 128 bits of content and 4 bits of checksum.

Now suppose you memorized your list of 12 words, but then later you don’t remember them in the correct order. Then you have about half a billion permutations to try because

12! = 479,001,600.

However, not all of these permutations are equally likely to be correct. You are far more likely to transpose a couple words, for example, than to completely scramble the list. So you would start with the list of words in the remembered order, then explore further and further afield from initial sequence. In mathematical terms, you’ll try the permutations of your remembered sequence in order of Cayley distance.

There are 66 permutations of a list of 12 things that are a transposition of two elements. So if the sequence you started with was

abcdefghijkl

then you’d try things like bacdefghijkl or ebcdafhghijkl. There are 66 possibilities because there are 66 ways to choose 2 items from a set of 12; you’re choosing the two items to swap.

If the above didn’t work, you could next explore permutations that are at a Cayley distance of 2 from what you remember. Now there are 1,925 possibilities. At a Cayley distance of 3 there are 32,670 possibilities.

In general, the number of arrangements at a Cayley distance k from the original sequence of n items equals

|S1(n, nk)|

where S1 denotes the Stirling number of the first kind.

There are other approaches to measuring how far apart two permutations are. Another possibility is Kendall distance which counts the number of adjacent transpositions needed to turn one sequence into another. Kendall distance is sometimes called bubble sort distance by analogy with the bubble sort algorithm. Kendall distance may be more appropriate than Cayley distance because people are more likely to accidentally transpose adjacent elements of a sequence.

Neither Cayley distance nor Kendall distance exactly correspond to the corruption of human memories, but both would guide you in correcting likely changes before exploring unlikely changes.

Related posts

What’s in your wallet?

What’s in your Bitcoin wallet? Very little. I don’t mean very little value, but very little data. If you’re a Bitcoin billionaire, your wallet still doesn’t contain very many bits.

You might reasonably expect that a crypto wallet is a container, given the analogy with an ordinary wallet, but it’s not much of a container. It’s more like a bank teller than a physical wallet. It’s primarily a piece of software that facilitates transactions.

It’s misleading to speak of a wallet as holding cryptocurrency. It holds private keys that allow you to access cryptocurrency, so in that sense it’s a password manager. Not only that, these days it’s effectively a password manager containing only one password.

When you back up a wallet, say the Exodus wallet, it asks you to store a seed phrase somewhere safe. Fine, but how do you back up your wallet, not just your seed phrase? How do you back up the data in your wallet? You don’t, because there isn’t any data in your wallet other than the seed phrase [1]. Everything is derived from your seed phrase, at least for the current style of wallet, technically known as a hierarchical deterministic (HD) wallet. The derivation algorithm is documented in BIP39.

The first crypto wallets held a set of private keys. But now wallets derive private keys from your seed phrase. So even though you may have used multiple private keys, the wallet doesn’t store multiple keys. It could store them, but it doesn’t need to. That’s why you can back up your wallet by only backing up your seed phrase.

Related posts

[1] A wallet may contain metadata, such as notes on transactions or software configuration preferences. This data isn’t recovered if you restore you wallet from just your seed phrase. But all your private keys can be regenerated. Think of the private keys as traditional metal house keys and the metadata as a sticky note on top of the keys. If you lose your keys and the note, you’d be happy to get the keys back.