Text case changes the size of QR codes

Let’s make a QR code out of a sentence two ways: mixed case and upper case. We’ll use Python with the qrcode library.

>>> import qrcode
>>> s = "The quick brown fox jumps over the lazy dog."
>>> qrcode.make(s).save("mixed.png")
>>> qrcode.make(s.upper()).save("upper.png")

Here are the mixed case and upper case QR codes.

The QR code creation algorithm interprets the mixed case sentence as binary data but it interprets the upper case sentence as alphanumeric data.

Alphanumeric data, in the context of QR codes, comes from the following alphabet of 45 characters:

0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ $%*+-./:

Since 45² = 2025 < 2048 = 211 two alphanumeric characters can be encoded in 11 bits. If text contains a single character outside this alphabet, such as a lower case letter, then the text is encoded as ISO/IEC 8859-1 using 8 bits per character.

Switching from mixed-case text to upper case text reduces the bits per character from 8 to 5.5, and so we should expect the resulting QR code to require about 30% fewer pixels. In the example above we go from a 33 × 33 grid down to a 29 × 29 grid, from 1089 pixels to 841.

Application to Bitcoin addressess

Bech32 encoding uses an alphabet of 32 characters while Base58 encoding uses an alphabet of 58 characters, and so the former needs about 17% more characters to represent the same data. But Bech32 uses a monocase alphabet, and base 58 does not, and so Bech32 encoding requires fewer QR code pixels to represent the same data as Base58 encoding.

(Bech32 encoding uses a lower case alphabet, but the letters are converted to upper case before creating QR codes.)

Related posts

Why and how Bitcoin uses Merkle trees

Yesterday’s post looked at a recently mined Bitcoin block, the 920,994th block in the blockchain, and verified that it contains the hash of the previous block. This post will look at the same block and verify its Merkle tree root.

Before getting down to the bytes, we’ll back up and say what a Merkle tree is and why Bitcoin uses one.

What is a Merkle tree?

A Merkle tree is a binary tree. The leaves of the tree are labeled with a hash of the data associated with each leaf. The label of each interior node is the hash of the concatenation of the two child hashes. The label of the root of the tree is the Merkle tree root.

The smallest change to any data in the tree will change the Merkle tree root [1]. If you weren’t concerned with efficiency, you wouldn’t need a Merkle tree. You could simply concatenate all the data in the leaves and compute a hash. But the Merkle tree makes it possible to verify a node without having to verify the whole tree.

To see this, imagine starting at the bottom of the tree and working your way up to compute the Merkle root. You need the hash value of the node you’re interested in and its sibling. Then you can compute the label of the parent node. If you know the label of the parent’s sibling node (you could call it the aunt or uncle) then you can compute the label of the grandparent node, and so forth until you get to the root. If you get the expected value for the root, the data is correct.

How Bitcoin uses Merkle trees

Bitcoin makes a Merkle tree with the data for each transaction being a leaf node. Simple Payment Verification (SPV) lets you verify a transaction knowing only the hash of the transaction, the hash of its sibling, and the hashes of the siblings moving up the tree: the aunts, great aunts, great great aunts etc.

If a block has 2n transactions, you only need n hash values to verify a single transaction. The binary tree has n levels, and you need two hashes at the bottom, and one hash for each of the levels between the bottom and the top. So in a block of 1024 transactions, you need 10 hash values. Note that this saves space two ways: you don’t need transaction data per se, only hashes of transaction data, and the number of hashes you need is log2 of the number of transactions.

What if the number of transactions is not a power of 2? If there is an odd number of nodes at a given level of the Merkle tree, the last node is paired with itself.

For example, the block that we’ll examine in detail contains 1279 transactions. At the bottom of our Merlke tree there are effectively 1280 nodes, with the 1280th node being a duplicate of the 1279th. Moving up the tree we have 640 nodes, 320 nodes, 160 nodes, … 10 nodes, and 5 nodes. The 5th node is paired with itself, and no the next level up has 3 nodes. The third node is paired with itself, making 2 nodes on the next level, and then the root above that.

Down to the bytes

The command

xxd -l 80 920994.dat

gives a hex dump of the block header:

0000 002e c2df b57e f582 75c2 7a64 33c5
edfd dfc1 af6a b9ae 708c 0000 0000 0000
0000 0000 bb15 accc 8940 3811 0b9b e1cf
b125 c0fa a65e fb1b 2fa4 61ed 989f 8d6f
e41e 04cf 5522 ff68 21eb 0117 f2bc ab1a

This time we’re looking at bytes 37 through 68, highlighted in orange. This is the Merkle tree root.

A miner verifying the block would recompute the Merkle tree root and confirm that it matches the value given in the header.

A Bitcoin block does not contain a Merkle tree per se, only its root. The hash of the data for each transaction, what Bitcoin calls a txid, is redundant since it can be computed. And if it were included in the block, a miner would need to recompute it to verify that it is correct.

We can fetch a list of txids for a block by calling

https://blockstream.info/api/block/{block_hash}/txids

with the hash of the block we want, which in our case is

00000000000000000001c2f89cee9f8c2fe88b4a93c2c2b75192918fc438ca32

This returns a JSON string with 1279 hash values:

fc1d784c62600565d4b4fbfe9cd45a7abc5f1c3273e294fd2b44450c5c9b9e9a
d4bce41744156a8f0e6588e42efef3519904a19169212c885930e908af1457ec
7bc6428c7d023108910a458c99359a2664de7c02343d96b133986825297e518d
…
c9132f178830d0c7a781e246bb5c2b3f4a9686c3d2b6f046b892a073d340eaff

Note that the API call did not pull the txids from the block; it computed them. Presumably the server computed the list of txids once and cached the results rather than computing them when we called the API.

We can compute the Merkle tree root by concatenating pairs of txids and hashing the results. Since the number of transactions is odd, the last txid is paired with itself. So moving one level up the tree, the first node is labeled with the hash of

fc1…e9ad4b…7ec

and the last node is labeled with the hash of

c91…affc91…aff

We continue this process until we compute the Merkle tree root. Then we can confirm that it matches the value given in the header. As with the block hash in the previous post, the Merkle tree root is stored in the header in little endian form.

Related posts

[1] There is a vanishingly small chance that a change in the data would not change the root. With a 256-bit hash, the chances of a change in the data not changing the root are 1 in 2256.

How blocks are chained in a blockchain

The high-level explanation of a blockchain says that each block contains a cryptographic hash of the previous block. That’s how the blocks are chained together.

That’s not exactly true, and it leaves out a lot of detail. This post will look in full detail at how Bitcoin blocks are chained together by inspecting the bits of two consecutive blocks. Looking at the low-level details reveals that some statements, like the paragraph above, are simplifications and half-truths.

For the purpose of this post, I downloaded two blocks that were added to the blockchain overnight, blocks 920993 and 920994, and saved the blocks in the binary files 920993.dat and 920994.dat.

Hashing headers

According to the simplistic description, the hash of block 920993 should be contained in block 920994. That’s not correct. We will see that the hash of the header of block 920993 is contained in block 920994 [1].

What exactly is the header?

You may hear that the header of a Bitcoin block is the first 80 bytes. That’s also not quite true. The first 4 bytes of a (production) Bitcoin block are the magic number 0xf9beb4d9. This is number chosen by Satoshi with no apparent significance, a random number unlikely to conflict with anything else. Blocks used in test versions of the blockchain begin with different magic numbers.

The next 4 bytes represent the size of the block as an unsigned integer in little endian layout. The magic number and the block size form a sort of pre-header 8 bytes long.

The API that I used to download the blocks does not include the pre-header, so the header is the first 80 bytes of the files I downloaded, though strictly speaking headers are bytes 9 through 89 of the full block.

We can see a hex dump of the header of block 920993 by running

xxd -l 80 920993.dat

which shows us the following.

00e0 ff3f e31d 6937 0e1c dba2 5321 5546
0ecc 00bf 678c a2e1 255e 0100 0000 0000
0000 0000 996f 2b91 fefc dc17 e530 6c70
9672 af27 4361 7608 1ded fde3 1157 10a5
200f 0f83 7022 ff68 21eb 0117 96f2 fbb6

How to hash?

OK, so we’re supposed to hash the header. What hash function should we apply? Bitcoin uses double SHA256,

SHA256²(header) = SHA256( (SHA256(header) )

We can compute this with openssl by running

head -c 80 920993.dat | openssl dgst -sha256 -binary | openssl dgst -sha256

Note that the first invocation of openssl dgst uses the option -binary, instructing the software to pass the raw bytes to the rest of the pipeline rather than display a text representation. The last part of the pipeline does not have that option because we want a human-readable representation at the end. The output is

c2dfb57ef58275c27a6433c5edfddfc1af6ab9ae708c00000000000000000000

Note that there are a lot of zeros on the end of the hash. That’s not a coincidence. More on that later.

Header of the next block

In the previous section we found the hash of block 920993, and we expect to see it inside the header of block 920994.

xxd -l 80 920994.dat

we can see that the header of block 920994 contains the following.

0000 002e c2df b57e f582 75c2 7a64 33c5
edfd dfc1 af6a b9ae 708c 0000 0000 0000
0000 0000 bb15 accc 8940 3811 0b9b e1cf
b125 c0fa a65e fb1b 2fa4 61ed 989f 8d6f
e41e 04cf 5522 ff68 21eb 0117 f2bc ab1a

The first 4 bytes (8 hex characters) are a version number, and following the version number we see the hash we were looking for.

Byte order

Why all the zeros in the hash of the block header? That’s a result of the proof of work problem that had to be solved in order to add block 920993 to the blockchain. Bitcoin miners tweak the details of the block [2] until they create a block whose hash begins with the required number of 0 bits [3].

But the hash value above doesn’t begin with zeros; it ends with zeros. What’s up with that?

The Bitcoin header and openssl dgst both display hashes in little endian order, i.e. the reverse of what you’d expect from a positional number system.

Related posts

[1] But if you only hash the header, couldn’t someone change the rest of the block? No, because the header contains the Merkle tree root. If someone changed a bit in the body of the block, that would change the Merkle tree root, which would change the header, which would change its hash.

[2] They don’t change the substance of the transactions, but they can change the order in which the transactions are included. And there’s a nonce value that they can change too. The only known way to produce a given number of zeros is by trial and error, and so this takes a lot of work. Having a block with the right leading zeros in its hash proves that you’ve put in the work, hence proof of work.

[3] This is another simplified half-truth. You don’t need to produce a certain number of leading zeros per se; you need to find a hash value less than a target value, and the target value is not in general an exact power of 2.

Ethereum’s consensus layer elliptic curve

I’ve written before about Bitcoin’s elliptic curve and Monero’s elliptic curve. In the Monero post I wrote “Bitcoin and Ethereum use the elliptic curve secp256k1.” That’s true, but it’s incomplete. Ethereum does use the elliptic curve secp256k1 for digital signatures, as does Bitcoin, but Ethereum also uses a different elliptic curve for its consensus layer.

Ethereum’s consensus layer uses the elliptic curve BLS12-381. This post will say a little bit about this curve, starting with unpacking the cryptic name. I won’t go into the advanced math behind the curve’s design but instead will focus on concrete calculations in Python to try to make the curve more tangible. The same curve is used in other cryptocurrencies, including Zcash.

First of all, BLS stands for Paulo Barreto, Ben Lynn, and Michael Scott, the developers of a family of elliptic curves known as the BLS curves, including BLS12-381. Incidentally, in the context of cryptography, BLS can refer to the BLS curves or to BLS signatures. The “L” is the same person in both instances, but the “B” and “S” in BLS signatures refer to Dan Boneh and Hovav Shacham.

The 381 in BLS12-381 refers to the fact that it is defined over a finite field whose order is a 381-bit number, namely

p = 0x1a0111ea397fe69a4b1ba7b6434bacd764774b84f38512bf6730d2a0f6b0f6241eabfffeb153ffffb9feffffffffaaab

We can verify that p is prime and that it has 381 bits.

>>> from sympy import isprime
>>> isprime(p)
True
>>> 2**380 < p < 2**381
True

The value of p comes from applying a certain polynomial to a parameter z with low Hamming weight, i.e. with lots of zeros in its binary representation. Why this matters is beyond the scope of this post, but we can show that it’s true.

>>> z = -0xd201000000010000
>>> p == (z-1)**2*(z**4 - z**2 + 1)//3 + z
True

The elliptic curve BLS12-381 is the set of points satisfying

y² = x³ + 4 mod p.

The 12 in BLS12-381 refers to an embedding degree k that we’ll get to shortly.

The elliptic curve BLS12-381 is pairing friendly, which is the reason for its use in the Ethereum consensus layer. This layer uses pairing-based cryptography to aggregate signatures. I may write more about that someday, but not today.

As I wrote a couple months ago,

An elliptic curve E/Fq is said to be pairing-friendly if r divides qk − 1 for some small k. Here r is the size of the largest prime-order subgroup of the curve.

In the case of BLS12-381, r = z4z2 + 1 and r is a 255-bit prime number.

>>> r = z**4 - z**2 + 1
>>> isprime(r)
True
>>> 2**254 < r < 2**255
True

And now we can verify that that the embedding degree is 12, showing the BLS12-381 is a pairing-friendly curve.

>>> (p**12 - 1) % r
0

So what is being paired with what? And what is being embedded into what? The group G1 of order r, is a subgroup of BLS12-381. It is paired with another group G2 also of order r, and there is a bilinear mapping from (G1, G2) to the multiplicative group of the finite field with p12 elements. For more details, see the section on BLS12-381 in Ben Edginton’s book Upgrading Ethereum: A technical handbook on Ethereum’s move to proof of stake and beyond.

Related posts

Zcash price doubled

My interest in cryptocurrency is primarily technical, and privacy coins are the most technically interesting. The math behind coins like Monero and Zcash is far more sophisticated than the math behind Bitcoin.

I’m also interested in the privacy angle since I work in data privacy. The zero-knowledge proof (ZKP) technology developed for and tested in privacy coins is applicable outside of cryptocurrency.

I don’t follow the price of privacy coins, but I heard about the price of Zcash doubling recently.

I’d like to understand why that happened, especially why it happened so suddenly, though I’m very skeptical of post hoc explanations of price changes. I roll my eyes when I hear reporters say the stock market moved up or down by some percent because of this or that event. Any explanation for the aggregate behavior of a complex system that fits into sound byte is unlikely to contain more than a kernel of truth.

Update: The price has gone up another 65% in the two days since I wrote this post.

Naval Ravikant said on October 1

Bitcoin is insurance against fiat.
ZCash is insurance against Bitcoin.

This was after Zcash had started rising, so Naval’s statement wasn’t the cause of the rise. But the idea behind his statement was in the air before he crystalized it. Maybe the recent increase in the price of Bitcoin led to the rise of Zcash as a hedge against Bitcoin falling. It’s impossible to know. Traders don’t fill out questionnaires explaining their motivations.

It has been suggested that more people are interested in using Zcash as a privacy-preserving currency, not just buying it for financial speculation. It’s plausible that this could be a contributing factor in the rise of Zcash, though interest in financial privacy is unlikely to spike over the course of one week, barring something dramatic like FDR’s confiscation of private gold in 1933.

The price of Monero increased along with Zcash, though not as by nearly as much.

This suggests there may be increased interest in privacy coins more generally and not just in Zcash.

It would be interesting to see how people are using Zcash and Monero—how much is being spent on goods and services and how much is just being traded as an investment—but this would be difficult since privacy coins obscure the details of transactions by design.

Related posts

Memorizing a list of seed words

If you need to memorize a list of up to eight words, particularly for a short period of time, the most efficient method may be brute force: rehearse the words in sequence until you can remember them. But if you need to remember more words, or remember the list for a longer time, some sort of peg system would be better.

Pegs

A peg system begins with a list of images firmly associated with integers. You drill these associations into your brain like driving pegs into wood, then “hang” things on the pegs.

The choice of pegs could be arbitrary, but it’s easier to acquire a set of pegs if there’s some pattern to them, such as the Major system. An example set of pegs based on the Major system follows.

  1. tea
  2. Noah
  3. ammo
  4. arrow
  5. law
  6. show
  7. key
  8. ivy
  9. pie
  10. toes
  11. toad
  12. dune

and so on.

Learning a set of pegs is harder than hanging things on the pegs. But once you have a peg system, you can reuse it for many different things.

Numbered lists

Here’s something I resisted but eventually came to accept: it’s easier to memorize a numbered list than an unnumbered list. For example, it’s easier to learned a numbered list of the US presidents—#1 Washington, … #16 Lincoln, … etc.—than to simply learn to recite the list.

This is counterintuitive because it requires memorizing more information. But the key is this: with a numbered list, you only have to remember pairs of pegs and items. To memorize a list of 100 items, you don’t have to memorize a sequence of 100 things; you have to remember a set of 100 pairs. If you forget an item, you forget one item: it doesn’t throw off anything else. It’s also easier to review and test yourself on a numbered list.

I’ve noticed when journalists are reporting on some memory stunt, they’ll say something like “Not only can he recite the Fortune 500, he could pull up each one by number.” Reading between the lines, this person used a peg system. He memorized each company with its number, not to make things harder, but to make things easier.

Seed phrases

A seed phrase for a crypto wallet is a list of 12 to 24 words. You could memorize a list of 12 words by rote without too much effort, but a list of 24 would be more than twice as hard. Since you want to remember wallet seed words for a long time, and there are potentially huge consequences to forgetting part of the list, I’d recommend a peg system.

To create an example, I chose 12 words by sampling the BIP39 word list with replacement using the command

    shuf -n 12 -r english.txt

This returned the following words:

  1. entry
  2. lava
  3. foot
  4. vibrant
  5. diet
  6. pulp
  7. canal
  8. name
  9. noble
  10. dream
  11. breeze
  12. bar

This probably isn’t a valid wallet address, but it’s like one [1].

To memorize the list using the pegs above, you would create mental images associating tea with entryNoah with lavaammo with foot, etc. You might imagine Noah’s ark on a sea of lava, or shooting yourself in the foot.

Noah's ark floating on lava

As I’ve written before, some of the BIP39 seed words are less than ideal for memorization. Lava creates a vivid mental image; entry does not. Maybe you could change entry to entryway, imagining a giant teapot in the entryway to your house.

The BIP39 words are uniquely determined by their first four letters. You could change the words if you like, as long as you keep the initial letters the same. For example, if you tried to type entryway into a wallet, autocomplete would kick in after you typed entr and so the rest of the letters don’t matter.

Recall

To recall the list, you go down your list of pegs and see what is associated with each. This is why pegs have to be unique. If you want to encode a number using the Major system, you can encode the same digits different ways at different times. For instance, maybe you encode 12 as Dune one time and Athena the next. But when recalling a list, you need to recall the peg for a number and recall what it’s associated with.

Related posts

[1] The seed words come from a list of 2048 words, so each word carries 11 bits, so 12 words encode a 128-bit address with 4 checksum bits left over. So there’s a 15/16 probability that a list of 12 words does not encode a valid address.

Silent Payments

Bitcoin transactions appear to be private because names are not attached to accounts. But that is not sufficient to ensure privacy; if it were, much of my work in data privacy would be unnecessary. It’s quite possible to identify people in data that does not contain any direct identifiers.

I hesitate to use the term pseudonymization because people define it multiple ways, but I’d say Bitcoin addresses provide pseudonymization but not necessarily deidentification.

Privacy and public ledgers are in tension. The Bitcoin ledger is superficially private because it does not contain user information, only addresses [1]. However, transaction details are publicly recorded on the blockchain.

To add some privacy to Bitcoin, addresses are typically used only once. Wallet software generates new addresses for each transaction, and so it’s not trivial to track all the money flowing between two people. But it’s not impossible either. Forensic tools make it possible to identify parties behind blockchain transactions via metadata and correlating information on the blockchain with information available off-chain.

Silent Payments

Suppose you want to take donations via Bitcoin. If you put a Bitcoin address on your site, that address has to be permanent, and it’s publicly associated with you. It would be trivial to identify (the addresses of) everyone who sends you a donation.

Silent payments provide a way to work around this problem. Alice could send donations to Bob repeatedly without it being obvious who either party is, and Bob would not have to change his site. To be clear, there would be no way to tell from the addresses that funds are moving between Alice and Bob. The metadata vulnerabilities remain.

Silent payments are not implemented in Bitcoin Core, but there are partial implementations out there. See BIP 352.

The silent payment proposal depends on ECDH (elliptic curve Diffie-Hellman) key exchange, so I’ll do a digression on ECDH before returning to silent payments per se.

ECDH

The first thing to know about elliptic curves, as far as cryptography is concerned, is that there is a way to add two points on an elliptic curve and obtain a third point. This turns the curve into an Abelian group.

You can bootstrap this addition to create a multiplication operation: given a point g on an elliptic curve and an integer nng means add g to itself n times. Multiplication can be implemented efficiently even when n is an enormous number. However, and this is key, multiplication cannot be undone efficiently. Given g and n, you can compute ng quickly, but it’s practically impossible to start with knowledge of ng and g and recover n. To put it another way, multiplication is efficient but division is practically impossible [2].

Suppose Alice and Bob both agree on an elliptic curve and a point g on the curve. This information can be public. ECDH lets Alice and Bob share a secret as follows. Alice generates a large random integer a, her private key, and computes a public key A = ag. Similarly, Bob generates a large random integer b and computes a public key Bbg. They’re shared secret is

aBbA.

Alice can compute aB because she (alone) knows a and B is public information. Similarly Bob can compute bA. The two points on the curve aB and bA are the same because

aBabgbagbA.

Back to silent payments

ECDH lets Alice and Bob share a secret, a point on a (very large) elliptic curve. This is the heart of silent payments.

In its simplest form, Alice can send Bob funds using the address

PB + hash(aBg.

A slightly more complicated form lets Alice send Bob funds multiple times. The kth time she sends money to Bob she uses the address

PB + hash(aB || kg

where || denotes concatenation.

But how do you know k? You have to search the blockchain to determine k, much like the way hierarchical wallets search the blockchain. Is the address corresponding to k = 0 on the blockchain? Then try again with k = 1. Keep doing this until you find the first unused k. Making this process more efficient is one of the things that is being worked on to fully implement silent payments.

Note that Bob needs to make B public, but B is not a Bitcoin address per se; it’s information needed to generate addresses via the process described above. Actual addresses are never reused.

Related posts

[1] Though if you obtain your Bitcoin through an exchange, KYC laws require them to save a lot of private information.

[2] Recovering n from ng is known as the discrete logarithm problem. It would be more logical to call it the discrete division problem, but if you write the group operation on an elliptic curve as multiplication rather than addition, then it’s a discrete logarithm, i.e. trying to solve for an unknown exponent. If and when a large-scale quantum computer exists, the discrete logarithm problem will be practical to solve, but presumably not until then.

Monero’s seed phrase words

I wrote a couple posts last month about the seed phrase words used by Bitcoin and other cryptocurrencies. There are 2048 words on the BIP39 list. Monero uses a different word list, one with 1626 words [1]. You can find Monero’s list here.

Why 1626 words?

It’s not hard to guess why the BIP 39 list has 2048 words: each one encodes 11 bits of a key because 211 = 2048. It’s not as obvious where the number 1626 comes from. It is the smallest value of n such that

n24 > 2256

Monero uses a seed phrase of 25 words, but the last word is a checksum, so there are 24 words which are used to create a 256-bit private key.

Distinctiveness

I criticized the BIP 39 list for being less than ideal for memorization because some words are similar phonetically, such as angle and ankle. Other words are harder to remember because they are not vivid nouns or verbs, such as  either or neither.

The Monero list has 200 words that differ by one character, compared to 484 for BIP 39.

But as with the BIP 39 list, some of Monero’s words are similar and not easy to visualize, such as adapt, adept, and adopt.

Prefix uniqueness

One nice feature of the BIP 39 list is that the first four letters of each word are distinct. So if you’re typing the words, autocomplete can fill in the rest of the word by the time you’ve entered four letters.

Monero goes one step further: all words are uniquely determined by the first three letters.

Overlap with other lists

Only about 1/4 of the words on Monero’s list. And most of Monero’s words are not in Google’s list of 10,000 most common words.

venn diagram M: 1626, B: 2048, M∩B: 425, G: 10000, M∩G: 701, B∩G: 1666, M∩B∩G: 339

Related posts

[1] Monero has other ways to convert words to keys. The way described here is now known as the Legacy mnemonics. The new Polyseed algorithm uses the BIP 39 word list.

Bech32 encoding

Bech32 is an algorithm for encoding binary data, specifically Bitcoin addresses [1], in a human-friendly way using a 32-character alphabet. The Bech32 alphabet includes lowercase letters and digits, removing the digit 1, and the letters b, i, and o.

The Bech32 alphabet design is similar to that for other coding schemes in that it seeks to remove letters that are visually similar. However, the order of the alphabet is unusual:

    qpzry9x8gf2tvdw0s3jn54khce6mua7l

I’ll explain the motivation for the ordering shortly. Here’s the same alphabet in conventional order, not the order used by Bech32:

    023456789acdefghjklmnpqrstuvwxyz

The Bech32 algorithm does more than represent 5-bit words as alphanumeric characters. The full algorithm is documented here. The algorithm is applied to the output of a RIPEMD-160 hash, and so the number of bits is a multiple of 5.

The Bech32 algorithm includes a BCH (Bose–Chaudhuri–Hocquenghem) checksum, and takes its name from this checksum.

Note that the BCH checksum is not a cryptographic hash; it’s an error correction code. Its purpose is to catch mistakes, not to thwart malice. Contrast this with Base58check that uses part of a SHA256 hash. Base58check can detect errors, but it cannot correct errors.

I’ve had a surprisingly hard time finding the specifics of the BCH(nd) code that Bech32 uses. Every source I’ve found either does not give values of n and d or gives different values. In any case, the order of the Bech32 alphabet was chosen so that mistakes humans are more like to make are more likely mistakes that BCH can detect and correct.

Related posts

[1] Bitcoin used bech32 encoding briefly before switching to a minor modification bech32m.

Vanity addresses

Bitcoin addresses are essentially hash values of public keys encoded in Base58. More details here.

The addresses are essentially random characters chosen from the Base58 alphabet: uppercase and lowercase Latin letters and digits, with 0 (zero), I (capital I), O (capital O), and l (lowercase l) removed to prevent errors.

You could create an address that begins with chosen characters by generating private/public key pairs and hashing the latter until you get the initial string you want [1]. People have done this. The vanitygen software gives the example of searching for an address that starts with “Love” (after the mandatory initial 1).

$ ./vanitygen 1Love
...                         
Pattern: 1Love
Address: 1LoveRg5t2NCDLUZh6Q8ixv74M5YGVxXaN
Privkey: 5JLUmjZiirgziDmWmNprPsNx8DYwfecUNk1FQXmDPaoKB36fX1o

Search time

You can’t specify a long prefix. The difficulty in finding an address with n specified initial characters grows exponentially, like 58n.

Say it takes a minute on average to find an address with the desired first four characters. Each additional character makes the search take 58 times longer, so specifying 5 characters would take about an hour. If you wanted to find an address that begins 1AcmeCorp it would take 584 minutes or over 21 years.

Disadvantages

There are several reasons you might not want to use a vanity address.

First, it’s now customary to generate new addresses for each transaction. This gives Bitcoin some degree of privacy, though not much. Presumably if you go to the effort of creating a custom address you’d like to hold on to it. Maybe you’d like to use it as an address for someone to send you donations, for example, and you’re OK reusing the address.

Second, if you use software someone else wrote to generate the custom address, the site may hold onto your private key or send it somewhere.

Third, you can’t use the address with an HD wallet. Related to the first point, HD wallets generate a sequence of key pairs, and thus addresses, according to some deterministic algorithm, and so it can’t hold an address it didn’t generate. But there are wallets that will hold such addresses.

Proof of work

Incidentally, the proof-of-work task for Bitcoin is very similar to finding vanity addresses. You take a block and twiddle what you can until it hashes to a value with a target number of leading zero bits. More on that here.