Compressing a set of hash values

Suppose you have a set of k hash values, each n bits long. Can you compress the set into less than kn bits?

It’s not possible to compress a list of hashes into less than kn bits, but you can hash a set into fewer bits.

Suppose you have a set of 230, roughly a billion, 64-bit hashes. Sort the set and look at the size of gaps between elements. You might expect that consecutive items on the list are roughly 234 apart, and so you could reconstruct the list by reporting the first item and the gaps between the rest, which are 34-bit numbers, not 64-bit numbers, a savings of 30 bits per hash.

This doesn’t exactly work, but it’s the kernel of a good idea. We don’t know that the distance between hashes can be represented by a 34 bit number. The gap could be more or less than 234, but we don’t expect it to often be much more than 234. So we use a variable-length encoding such that when the distance between values is on the order of 234, or less, we save bits, but we allow for the distance to be any size.

Applications

What we’ve described is the essence of Golomb-Rice coding. Its implementation in the Bitcoin protocol is referred to as Golomb-Coded Sets (GCS), described in BIP 158. Golomb-Rice coding is also used other applications where the  values to be compressed are not hashes, such as in lossless auto compression.

Encoding

Let’s go into some detail as to how the distances between sorted values are represented. Suppose you expect the differences to be on the order of M where M is a power of 2. For each difference d, let q and r be the quotient and remainder by M, i.e.

dqMr.

Encode q as a unary number, i.e. string of q 1s, and encode r as an ordinary binary number. Then Golomb-Rice coding of d is the concatenation of the representations of q and r. with a 0 in the middle as a separator. Using || to denote string concatenation we have

unary(q)  ||  0  ||  binary(r).

In general, unary encoding is extremely inefficient, but we’re betting that q will typically be quite small.

The reason we require M to be a power of 2 is so the representation of r will have a fixed length [1].

Let’s work out an example. Suppose M = 220 and

d = 222 + 123456 = 4 × M + 123456.

Then we write q as 1111 and r as 0011110001001000000 and encode d as the string

111100011110001001000000

Decoding

You can concatenate the encodings of consecutive d values and still be able to unambiguously decode the result. Because the r representations have a constant length, you know when an r ends and the next q begins.

For example, suppose we have the following string of bits.

1110010011001011001011111001000000110010001110011101111000101111011

We decode the string from left to right. We count how many ones we see, skip over a 0, then regarding the next 20 bits as a binary number.

111 0 01001100101100101111 1 0 01000000110010001110 0 11101111000101111011

We see three 1s before the first 0 and so we conclude q1 = 3. We skip over the 0 and read the value of r.

r1 = 01001100101100101111two = 314159.

Next we see a single 1 before the next 0, so q2 = 1. We read the next value of r as

r2 =  01000000110010001110two = 265358.

Next we see a 0, i.e. there are no 1s, and so the final q3 = 0. And we have

r3 =  11101111000101111011two = 979323.

So our reconstructed values of d are

d1 = q1 M+ r1 = 3 × 220 + 314159 = 3459887

d2 = q2 M+ r2 = 1 × 220 + 265358 = 1313934

d3 = q3 M+ r3 = 0 × 220 + 979323 = 979323.

Now these are difference values. We need to know the smallest value m in order to construct the original set of values from the differences. Then the full values are m, m + d1, m + d1 + d2, and m + d1 + d2 + d3,.

Related posts

[1] This is the Rice part. Robert Rice simplified Samuel Golomb’s encoding scheme in the special case that M is a power of 2.

Preprocessing text to make it more compressible

Repetitive text compresses efficiently. Text like the lyrics to Jingle Bells ought to compress more efficiently than ordinary prose, assuming the compression algorithm can exploit the repetition.

The idea of the Burrows-Wheeler transform is to permute text in before compressing it. The hope is that the permutation will make the repetition in the text easier for a compression algorithm to take advantage of. The permutation must be reversible, so the compressed file can be uncompressed later.

It’s not immediately obvious that the Burrows-Wheeler transform is reversible; that may be the topic of another post some day. This post will explain what the transform is and show that it rearranges text in a way that makes run-length encoding more efficient.

Algorithm and example

The Burrows-Wheeler transform (BWT) lists all the rotations of a string in a table, sorts the table, then takes the last column as the transform value. (Software implementing the BWT does not need to actually construct the table, but it gives the same result as if it did.)

Before tabulating the rotations, the algorithm adds a character for beginning of string and one for end of string. We’ll use ^ and $ respectively, based on the meaning of these symbols in regular expressions.

Here is an example applied to wabisabi.

^wabisabi$          ^wabisabi$
$^wabisabi          $^wabisabi
i$^wabisab          abi$^wabis
bi$^wabisa          abisabi$^w
abi$^wabis          bi$^wabisa
sabi$^wabi          bisabi$^wa
isabi$^wab          i$^wabisab
bisabi$^wa          isabi$^wab
abisabi$^w          sabi$^wabi
wabisabi$^          wabisabi$^

The table on the left lists all rotations of ^wabisabi$, and the table on the right is the sorted form of the table on the left. The last column, $iswaabbi^, is the BWT of wabisabi. Note that the a‘s and the b‘s are grouped together in the transformed string.

When we sort the table we run into an annoying complication: what is the sort order of the markers for beginning of text and end of text? I used the Python code from the Wikipedia article on BWT for the examples below, so I followed the code’s sorting convention that the begin string symbol comes first, then the end string symbol, then ordinary text.

Run-length encoding

At the top of the post I mentioned the lyrics to Jingle Bells as an example of text with repetition. If we calculate the BTW of

jingle bells, jingle bells, jingle all the way.

we get

$.eee,,lessy w  lllhbbnnntjjj  ^lgggaeelliiill  a

Now we have a lot of repeated letters, which means we could compress the string with run-length encoding. For example, we could write j3 rather than jjj to indicate a string of three j’s.

$.e3,2les2y w 2l3hb2n3tj3 2^lg3ae2l2i3l2 2a

This isn’t much shorter, though in practice run length encoding would be implemented in binary and would not use an ASCII character 2 to indicate that a character is repeated.

We could make a string of text even more compressible if we simply sorted it. But sorting is not reversible: any permutation of the text would lead to the same sorted text. This is what we’d get if we sorted the text in the example above

       ,,.aabbeeeeeeggghiiijjjlllllllllnnnsstwy

Run length encoding compresses this even better, though the result cannot be uncompressed.

 72.a2b2e6g3h3j3l9n3s2twy

The way to think of BWT is that it partially sorts text in a reversible way. This is kinda paradoxical: sorting is not reversible. You either sort the characters, or you perform a reversible permutation. You can’t do both. But due to the repetition patterns in natural language text [1] you can in practice do both.

The BWT applied to arbitrary text won’t partially sort it. You can’t apply a reversible operation to random input and make it more ordered. But when applied to ordinary prose, BWT does tend to cluster letters together, especially when applied to longer inputs.

As an example of longer input, I applied the BWT to Lincoln’s Gettysburg Address (1454 characters). Here’s an excerpt from the middle of the output.

arrouiuuiga     sttcctsss tttttttttwtttwt     TtttttttttTsttttt
tttwtttww ttttgwww tsgggtLddddddhhddffhmvw    f tvtnttvafattttttttteb
h hh hrn  sflleelcllsrallllllaiap  ueretppptg     aiuaaaa  laobhooeeo
e  n  oioiieaoaaaaoooooieoeooi     aooiaaaaaauuei   uiiioioiieiir  o      

This shows that the output is indeed partially sorted.

Related posts

[1] The BWT is applied to more than natural text. It is used, for example, in compressing genetic sequence data. The key property of the input is that some symbols tend to follow other symbols. This is the kind of repetition that the algorithm is intended to exploit.

Compression and interpolation

Data compression is everywhere. We’re unaware of it when it is done well. We only become aware of it when it is pushed too far, such as when a photo looks grainy or fuzzy because it was compressed too much.

The basic idea of data compression is to not transmit the raw data but to transmit some of the data along with instructions for how to approximately reconstruct the rest [1].

Fifty years ago scientists were concerned with a different application of compression: reducing the size of mathematical tables. Books of tabulated functions are obsolete now, but the principles used in producing these tables are still very much relevant. We use compression and interpolation far more often now, though it’s almost always invisibly executed by software.

Compressing tables

In this post I want to expand on comments by Forman Acton from his book Numerical Methods That Work on compression.

Many persons are unaware of the considerable compression in a table that even the use of quadratic interpolation permits. A table of sin x covering the first quadrant, for example, requires 541 pages if it is to be linearly interpolable to eight decimal places. If quadratic interpolation is used, the same table takes only one page having entries at one-degree intervals with functions of the first and second differences being recorded together with the sine itself.

Acton goes on to mention the advantage of condensing shelf space by a factor of 500. We no longer care about saving shelf space, but we may care very much about saving memory in an embedded device.

Quadratic interpolation does allow more compression than linear interpolation, but not by a factor of 500. I admire Acton’s numerical methods book, but I’m afraid he got this one wrong.

Interpolation error bound

In order to test Acton’s claim we will need the following theorem on interpolation error [2].

Let f be a function so that f(n+1) is continuous on [a, b] and satisfies |f(n+1) (x)| ≤ M. Let p be the polynomial of degree ≤ n that interpolates f at n + 1 equally spaced nodes in [a, b], including the end points. Then on [a, b],

|f(x) - p(x)| \leq \frac{1}{4(n+1)} M \left(\frac{b-a}{n}\right)^{n+1}

Quadratic interpolation error

Acton claims that quadratic interpolation at intervals of one degree is adequate to produce eight decimal places of accuracy. Quadratic interpolation means n = 2.

We have our function tabulated at evenly spaced points a distance h = π/180 radians apart. Quadratic interpolation requires function values at three points, so ba = 2h = π/90. The third derivative of sine is negative cosine, so M = 1.

This gives an error bound of 4.43 × 10−7, so this would give slightly better than six decimal place accuracy, not eight.

Linear interpolation error

Suppose we wanted to create a table of sine values so that linear interpolation would give results accurate to eight decimal places.
In the interpolation error formula we have M = 1 as before, and now n = 1. We would need to tabulate sine at enough points that h = b − a is small enough that the error is less than 5 × 10−9. It follows that h = 0.0002 radians. Covering a range of π/2 radians in increments of 0.0002 radians would require 7854 function values. Acton implicitly assumes 90 values to a page, so this would take about 87 pages.

Abramowitz and Stegun devotes 32 pages to tabulating sine and cosine at increments of 0.001 radian. This does not always guarantee eight decimal place accuracy using linear interpolation, but it does guarantee at least seven places (more on that here), which is better than a table at one degree increments would deliver using quadratic interpolation. So it would have been more accurate for Acton to say quadratic interpolation reduces the number of pages by a factor of 30 rather than 500.

Cubic interpolation error

If we have a table of sine values at one degree increments, how much accuracy could we get using cubic interpolation? In that case we’d apply the interpolation error theorem with n = 3 and ba = 3(π/180) = π/60. Then the error bound is 5.8 × 10−9. This would usually give you eight decimal place accuracy, so perhaps Acton carried out the calculation for cubic interpolation rather than quadratic interpolation.

Related posts

[1] This is what’s known as lossy compression; some information is lost in the compression process. Lossless compression also replaces the original data with a description that can be used to reproduce the data, but in this case the reconstruction process is perfect.

[2] Ward Cheney and David Kincaid. Numerical Methods and Computation. Third edition.

 

Varicode

Varicode is a way of encoding text and control characters into binary using code words of variable length. It was developed as part of the PSK31 protocol for digital communication over amateur radio.

In the spirit of Morse code, it uses short code words for common characters and longer code words for less common characters in the expectation that this will result in shorter encodings.

If you use variable length words, you’ve got to have some way of knowing when one word ends and the next begins. Varicode solves this problem by using only keywords that begin and end with 1 and that do not contain two consecutive zeros. Then 00 is inserted between code words. Since 00 cannot appear inside a code word, these bits unambiguously mark the space between code words.

Synchronization

Varicode is self-synchronizing in the sense that if you jump into a stream of bits produced by Varicode, as soon as you see two zeros, you know you’re at a code word boundary and can start reading from there. You’ve lost any bits that were transmitted before you jumped in, but you can parse everything going forward.

ASCII doesn’t have this problem or this robustness. It doesn’t have the problem of determining code word boundaries because every code word is eight bits long. But then to read a stream of ASCII bits you need to know your position mod 8. If you jump into a stream of bits, you don’t know where the next code word boundary will be, though you may be able to infer it by trying 8 possibilities and seeing which produces the most intelligible results.

Regular expression

Another way to state the rules for forming Varicode code words is to say that 1 is a valid code (the code for a space, ASCII 0x20) and that you can form new codes by prefixing 1 or 10 to a code. In terms of regular expressions, this says a Varicode code word matches

    (1|10)*1

Fibonacci numbers

How many code words are there of length n? Well, there are two ways to make a code word that long: you either put a 1 in front of a code word of length n-1, or you put a 10 in front of a code word of length n-2. So the number of code words of length n equals the number of code words of length n-1 plus the number of code words of length n-2. That is, the number of code words satisfies the same recurrence relation as the Fibonacci numbers.

It’s easy to see that there’s only one code word of length one, and only one code word of length two, so the number of code words of each length satisfies the initial conditions for the Fibonacci sequence as well, and so they are the Fibonacci numbers.

Efficiency

Varicode encodes a lot more than lower case letters—it encodes most ASCII characters— and so it would take some work to discover the relative frequencies of the characters, and this frequency would depend on where Varicode is used. As far as I know Varicode is use primarily (only?) in PSK31, and so the relevant frequency would be the frequencies in messages sent over PSK31, not English more generally.

You can find the code words of each letter here.

To make things easier, let’s suppose messages are limited to lower case letters and spaces, and that the letters follow the same distribution as in English in general.

We can use the letter frequencies here, except these don’t take spaces into account. If we assume words are about 5 letters long, then the probability of a character being a space is 1/6 and the probabilities of the other characters conditional on not being a space are given by the table. This means the letter probabilities need to be multiplied by 5/6.

This gives us a an expected length of 3.89 bits per letter, which is effectively 5.89 bits when you consider the 00 pattern we have to add between letters.

You could represent the 26 letters and a few more characters using 5 bits, but the result would not be self-synchronizing. One way of looking at this to say that the compression provided by variable length encoding nearly pays for the overhead required to make the code self-synchronizing.

Related posts

Naive compression of genetic data

There are special compression algorithms for genetic sequence data, but I was curious how well simply zipping a text file would work.

I downloaded a 14 MB text file containing DNA sequence data from a fruit fly and compressed it as a zip file and as a 7z file. The result was about 3.5 MB, which is basically no compression beyond the obvious.

The file contains millions of A’s, C’s, G’s, and T’s and nothing more [0]. (I prepared the file by removing comments and line breaks.)

    AAAATCAATATGTTGCCATT…

There are only four possible characters, so each carries two bits of information [1], but is encoded in an 8-bit ASCII character. The most obvious encoding would pack four letters into a byte. That would compress a 14 MB text file down to a 3.5 MB binary file.

Here are the exact numbers.

|----------+----------+-------|
| file     |     size |  ratio|
|----------+----------+-------|
| original | 14401122 | 1.000 |
| zip      |  3875361 | 3.716 |
| 7z       |  3419294 | 4.212 |
|----------+----------+-------|

So the 7z format does a little better than simply packing groups of four letters into a byte, but the zip format does not.

There is repetition in genome sequences, but apparently generic compression software is unable to exploit this repetition. You can do much better with compression algorithms designed for genetic data.

Update

The plot thickens. Imran Haque kindly let me know that I overlooked the N’s in the data. These are placeholders for unresolved bases. You can’t simply encode each base using two bits because you need five states.

The number of N’s is small—at least in this example, though I imagine they would be more common in lower quality data—and so the calculation in footnote [1] is still approximately correct [2]. There are about two bits of information in each base, but this is on average. Shannon would say you can expect to compress your text file by at least 75%. But you can’t simply represent each base with two bits as I suggested because you need to make room for the possibility of a null base.

Note that the total information of a file is not the number of symbols times the information per symbol, unless all the symbols are independent. In the case of genetic data, the symbols are not independent, and so more compression is possible.

***

[0] Or so I thought. See the Update section.

[1] The letter frequencies are not exactly equal, but close: 26.75% A, 23.67% C, 23.94% G, and 25.58% T. The Shannon entropy is 1.998, so essentially two bits.

[2] N’s account for 0.04% of the data. Accounting for the N’s increases the Shannon entropy to 2.002.