Assumed technologies

I just had a client ship me a laptop. We never discussed what OS the computer would run. I haven’t opened the box yet, but I imagine it’s running Windows 10.

I’ve had clients assume I run Windows, but also others who assume I run Linux or Mac. I don’t recall anyone asking me whether I used a particular operating system.

When clients say “I’ll send you some code” without specifying the language, it’s usually Python. Maybe they’ve seen Python code I’ve written, but my impression is that they assume everyone knows Python.

Other tools some will assume everyone uses:

  • Bash
  • Git and Github
  • Skype
  • MS Office
  • Signal
  • Google apps
  • Adobe Photoshop and Illustrator
  • Markdown
  • Jupyter notebooks

When I was in college, nearly every computer I saw ran Unix or Mac. I had no idea that the vast majority of the world’s computers ran Windows. I was in a bubble. And like most people in a bubble, I didn’t know I was in a bubble. I wish I had seen something like this blog post.

Digital signatures with oil and vinegar

“Unbalanced oil and vinegar” is a colorful name for a cryptographic signature method. This post will give a high-level description of the method and explain where the name comes from.

The RSA encryption algorithm depends on the fact that computers can easily multiply enormous numbers, but they cannot efficiently factor the product of two enormous primes. Whenever you have something that’s easy to do but hard to undo, you might be able to make an encryption algorithm out of it.

The unbalanced oil and vinegar (UOV) digital signature algorithm is analogous to RSA in that it also depends on the difficulty of factoring. But UOV is based on the difficulty of factoring the composition of a linear and nonlinear operator, not multiplying prime numbers. One advantage of UOV over RSA is that UOV is quantum-resistant. That is, if large quantum computers become practical, UOV signatures will remain hard to forge (or so it is currently believed) whereas RSA signatures would be easy to forge.

Solving large systems of multivariate polynomial equations over finite fields is hard, provably NP-hard, unless there’s some special structure that makes things easier. Several proposed post-quantum digital signature algorithms are based on this, such as the LUOV variant on UOV.

The idea behind UOV is to create systems of equations that have a special structure, with some “oil” variables and some “vinegar” variables, so named because they do not mix, or rather mix in a very simple, convenient way. This special structure is kept secret, and is obscured by composition with an invertible linear operator. This operator acts like a blender, thoroughly mixing the oil and vinegar. The term “unbalanced” refers to the fact that the scheme is more secure if you do not have equal numbers of “oil” and “vinegar” variables.

Polynomials over finite fields. Polynomials over finite fields everywhere!

Someone wanting to sign a file with the UOV algorithm knows the oil-and-vinegar structure and produces a vector that is mapped to a specified value, inverting the composition of the linear operator and the polynomials. They can do this because they know the factorization into this special structure. Someone wanting to verify a UOV signature only knows the (apparently unstructured) composition. They just see a large system of multivariate polynomial equations. They can stick a signature in and verify that the output is what it’s supposed to be, but they couldn’t produce a signature because they can’t invert the system. [1]

How large do these systems of polynomials need to be? On the order of a hundred equations and variables, though with more variables than polynomials. Not that large compared to linear systems, where one can efficiently solve systems with millions of equations and variables. And the polynomial are only quadratic. So in one sense the systems are small. But it takes several kilobytes [2] to describe such systems, which makes the public keys for UOV large relative to currently popular digital signature algorithms such as ECDSA. The signatures produced by UOV are small, but the public keys are large.

Related posts

[1] The system is not invertible in the sense of being one-to-one because it’s underdetermined. By inverting the system we mean producing any input that maps to the desired output. This solution is not generally unique.

[2] Representing m quadratic polynomials in n variables over a field of size b bits requires bmn²/2 bits. So 80 quadratic polynomials in 120 variables over GF(28) would require 8 × 80 × 120²/2 = 4,608,000 bits = 576 kilobytes. The LUOV variation on UOV mentioned above reduces the key sizes quite a bit, but it still requires larger public keys than ECDSA.

Why isn’t CPU time more valuable?

Here’s something I find puzzling: why isn’t CPU time more valuable?

I first thought about this when I was working for MD Anderson Cancer Center, maybe around 2002. Our research in adaptive clinical trial methods required bursts of CPU time. We might need hundreds of hours of CPU time for a simulation, then nothing while we figure out what to do next, then another hundreds hours to run a modification.

We were always looking for CPU resources, and we installed Condor to take advantage of idle PCs, something like the SETI at Home or GIMPS projects. Then we had CPU power to spare, sometimes. What could we do between simulations that was worthwhile but not urgent? We didn’t come up with anything.

Fast forward to 2019. You can rent CPU time from Amazon for about 2.5 cents per hour. To put it another way, it’s about 300 times cheaper per hour to rent a CPU than to hire a minimum wage employee in the US. Surely it should be possible to think of something for a computer to do that produces more than 2.5 cents per CPU hour of value. But is it?

Well, there’s cryptocurrency mining. How profitable is that? The answer depends on many factors: which currency you’re mining and its value at the moment, what equipment you’re using, what you’re paying for electricity, etc. I did a quick search, and one person said he sees a 30 to 50% return on investment. I suspect that’s high, but we’ll suppose for the sake of argument there’s a 50% ROI [1]. That means you can make a profit of 30 cents per CPU day.

Can we not thinking of anything for a CPU to do for a day that returns more than 30 cents profit?! That’s mind boggling for someone who can remember when access to CPU power was a bottleneck.

Sometimes computer time is very valuable. But the value of surplus computer time is negligible. I suppose it all has to do with bottlenecks. As soon as CPU time isn’t the bottleneck, its value plummets.

Update: According to the latest episode of the Security Now podcast, it has become unprofitable for hackers to steal CPU cycles in your browser for crypto mining, primarily because of a change in Monero. Even free cycles aren’t worth using for mining! Mining is only profitable on custom hardware.

***

[1] I imagine this person isn’t renting time from Amazon. He probably has his own hardware that he can run less expensively. But that means his profit margins are so thin that it would not be profitable to rent CPUs at 2.5 cents an hour.

Google Adiantum and the ChaCha RNG

The ChaCha cryptographic random number generator is in the news thanks to Google’s Adiantum project. I’ll discuss what’s going on, but first a little background.

Adiantum maidenhead fern

The name of the project comes from a genus of fern. More on that below as well.

One-time pads

The one-time pad is a provably unbreakable way to encrypt things. You create a sheet of random bits and give your counterpart an exact copy. Then when it comes time for you to send an encrypted message, you convert your message to a stream of bits, XOR your message with the random bits you exchanged previously, and send the result. The recipient then takes the XOR of the received message with the pad of random bits, and recovers the original message.

This is called a one-time pad because it’s a pad of bits that you can only use one time. If you reuse a pad, it’s no longer unbreakable.

One-time pads are impractical for a couple reasons. First, it’s hard to generate truly random bits, especially in bulk. Second, exchanging the pads is almost as difficult as exchanging messages.

Stream ciphers

So here’s a bright idea: we’ll get around both of the problems with one-time pads by using pseudorandom bits rather than random bits! The both parties can generate their own random bits.

Many people have had this idea, and it’s not necessarily a bad one. It’s called a stream cipher. The problem is that most pseudorandom number generators are not up to the task. You need a cryptographically secure RNG, and most RNGs are far from secure. The ChaCha RNG, however, appears to be good enough to use in a stream cipher, given enough rounds of scrambling [1], and Google is using it for full disk encryption in Android devices.

Full disk encryption

If you forget your password to your computer, you may not be able to access your data, but a thief still could by removing the hard drive and accessing it from another computer. That is, unless the disk is encrypted.

Full disk encryption on a laptop, such as BitLocker on Windows or FileVault on OSX, is usually implemented via AES encryption with hardware acceleration. If you don’t have special hardware for encryption, AES can be too slow.

Adiantum: ChaCha encryption on Android

On low-end devices, ChaCha encryption can be around 5x faster than AES. So Google is using ChaCha for Android devices, using what it calls Adiantum.

You can read the technical details in [2], and you can read more about the ChaCha random number generator in [3].

So where does the name Adiantum come from? It’s a Victorian name for a genus of maidenhair ferns, symbolic of sincerity and discretion.

Related posts

[1] Adiantum using ChaCha with 12 rounds. TLS 1.3 uses ChaCha with 20 rounds.

[2] Adiantum: length-preserving encryption for entry-level processors by Google employees Paul Crowley and Eric Biggers.

[3] IRTF RFC 8439: ChaCha20 and Poly1305 for IETF Protocols

Hash function menagerie

Here’s an oversimplified survey of cryptographic hash functions: Everyone used to use MD5, now they use some variation on SHA.

There’s some truth to that. MD5 was very popular, and remains popular years after it was proven insecure. And now variations on SHA like SHA1 and SHA256 are commonly used. But there are a lot more cryptographic hash functions in common use. Continue reading

Reversing an MD5 hash

The MD5 hashing algorithm was once considered secure cryptographic hash, but those days are long gone [1]. For a given hash value, it doesn’t take much computing power to create a document with the same hash.

Hash functions are not reversible in general. MD5 is a 128-bit hash, and so it maps any string, no matter how long, into 128 bits. Obviously if you run all strings of length, say, 129 bits, some of them have to hash to the same value. (Another win for the pigeon hole principle.)

And yet in practice it may be possible to reverse a hash, given some context. In the context of short, weak passwords, it may be possible to determine what text was hashed to create a particular value. All it may take is a quick web search [2]. For example, let’s take an insecure password like “password” and run it through a bit of Python to compute its MD5 hash.

>>> import hashlib
>>> def foo(x): 
...     print(hashlib.md5(x.encode('utf-8')).hexdigest())
>>> foo("password")
5f4dcc3b5aa765d61d8327deb882cf99

A Google search returns over 21,000 hits on 5f4dcc3b5aa765d61d8327deb882cf99, and the second one shows that it’s the hash of “password”.

If I try a slightly less naive password like “p@$$word” I still get several hits, indicating that the hash is part of a list of compromised passwords.

Not every hash of a short string can be reversed this way. If I hash my business phone number, for example, I get something that does not yet appear in Google searches. The hash could still be reversed easily, but it would take more than just a Google search.

See the next post for how salting can thwart web search attacks.

Related posts

[1] MD5 was invented in 1992 and the first flaw was discovered in 1996. Experts started moving away from MD5 at that time. More flaws were discovered over time. In 2010 the CMU Software Engineering Institute declared that MD5 was “cryptographically broken and unsuitable for further use.” It’s still being used, though not as much. MD5 still makes a useful checksum, though it’s not cryptographically secure.

[2] The same would be true of a secure hash of an insecure password. For example, SHA-256 is better than MD5, but you could look up the SHA-256 hash values of terrible passwords too. But MD5 hashes are easier to search on. In my experiments, I got far fewer hits searching on SHA-256 hash values.

If you’re trying to reverse hash values on your own computer without a web search, the MD5 value would require much less computation than the SHA-256 value.

Projecting Unicode to ASCII

Sometimes you need to downgrade Unicode text to more restricted ASCII text. For example, while working on my previous post, I was surprised that there didn’t appear to be an asteroid named after Poincaré. There is one, but it was listed as Poincare in my list of asteroid names.

Python module

I used the Python module unidecode to convert names to ASCII before searching, and that fixed the problem. Here’s a small example showing how the code works.

    import unidecode

    for x in ["Poincaré", "Gödel"]:
        print(x, unidecode.unidecode(x))

This produces

    Poincaré Poincare
    Gödel Godel

Installing the unidecode module also installs a command line utility by the same name. So you could, for example, pipe text to that utility.

As someone pointed out on Hacker News, this isn’t so impressive for Western languages,

But if you need to project Arabic, Russian or Chinese, unidecode is close to black magic:

>>> from unidecode import unidecode
>>> unidecode("北亰")
'Bei Jing '

(Someone has said in the comments that 北亰 is a typo and should be 北京. I can’t say whether this is right, but I can say that unidecode transliterates both to “Bei Jing.”)

Projections

I titled this post “Projecting Unicode to ASCII” because this code is a projection in the mathematical sense. A projection is a function P such that for all inputs x,

PP(x) ) = P(x).

That is, applying the function twice does the same thing as applying the function once. The name comes from projection in the colloquial sense, such as projecting a three dimensional object onto a two dimensional plane. An equivalent term is to say P is idempotent. [1]

The unidecode function maps the full range of Unicode characters into the range 0x00 to 0x7F, and if you apply it to a character already in that range, the function leaves it unchanged. So the function is a projection, or you could say the function is idempotent.

Projection is such a simple condition that it hardly seems worth giving it a name. And yet it is extremely useful. A general principle in user interface to design is to make something a projection if the user expects it to be a projection. Users probably don’t have the vocabulary to say “I expected this to be a projection” but they’ll be frustrated if something is almost a projection but not quite.

For example, if software has a button to convert an image from color to grayscale, it would be surprising if (accidentally) clicking button a second time had any effect. It would be unexpected if it returned the original color image, and it would be even more unexpected if it did something else, such as keeping the image in grayscale but lowering the resolution.

Related posts

[1] The term “idempotent” may be used more generally than “projection,” the latter being more common in linear algebra. Some people may think of a projection as linear idempotent function. We’re not exactly doing linear algebra here, but people do think of portions of Unicode geometrically, speaking of “planes.”

Sine of a googol

How do you evaluate the sine of a large number in floating point arithmetic? What does the result even mean?

Sine of a trillion

Let’s start by finding the sine of a trillion (1012) using floating point arithmetic. There are a couple ways to think about this. The floating point number t = 1.0e12 can only be represented to 15 or 16 significant decimal figures (to be precise, 53 bits [1]), and so you could think of it as a representative of the interval of numbers that all have the same floating point representation. Any result that is the sine of a number in that interval should be considered correct.

Another way to look at this is to say that t can be represented exactly—its binary representation requires 42 bits and we have 53 bits of significand precision available—and so we expect sin(t) to return the sine of exactly one trillion, correct to full precision.

It turns out that IEEE arithmetic does the latter, computing sin(1e12) correctly to 16 digits.

Here’s the result in Python

    >>> sin(1.0e12)
    -0.6112387023768895

and verified by Mathematica by computing the value to 20 decimal places

    In:= N[Sin[10^12], 20]
    Out= -0.61123870237688949819

Range reduction

Note that the result above is not what you’d get if you were first to take the remainder when dividing by 2π and then taking the sine.

    >>> sin(1.0e12 % (2*pi))
    -0.6112078499756778

This makes sense. The result of dividing a trillion by the floating point representation of 2π, 159154943091.89536, is correct to full floating point precision. But most of that precision is in the integer part. The fractional part is only has five digits of precision, and so we should expect the result above to be correct to at most five digits. In fact it’s accurate to four digits.

When your processor computes sin(1e12) it does not naively take the remainder by 2π and then compute the sine. If it did, you’d get four significant figures rather than 16.

We started out by saying that there were two ways of looking at the problem, and according to the first one, returning only four significant figures would be quite reasonable. If we think of the value t as a measured quantity, measured to the precision of floating point arithmetic (though hardly anything can be measured that accurately), then four significant figures would be all we should expect. But the people who designed the sine function implementation on your processor did more than they might be expected to do, finding the sine of a trillion to full precision. They do this using a range reduction algorithm that retains far more precision than naively doing a division. (I worked on this problem a long time ago.)

Sine of a googol?

What if we try to take the sine of a ridiculously large number like a googol (10100)? This won’t work because a googol cannot be represented exactly as a floating point number (i.e. as a IEEE 754 double). It’s not too big; floating point numbers can be as large as around 10308. The problem is that a number that big cannot be represented to full precision. But we can represent 2333 exactly, and a googol is between 2332 and 2333. And amazingly, Python’s sine function (or rather the sine function built into the AMD processor on my computer) returns the correct result to full precision.

    >>> sin(2**333)
    0.9731320373846353

How do we know this is right? I verified it in Mathematica:

    In:= Sin[2.0^333]
    Out= 0.9731320373846353

How do we know Mathematica is right? We can do naive range reduction using extended precision, say 200 decimal places.

    In:= p = N[Pi, 200]
    In:= theta = x - IntegerPart[x/ (2 p)] 2 p
    Out= 1.8031286178334699559384196689…

and verify that the sine of the range reduced value is correct.

    In:= Sin[theta]
    Out= 0.97313203738463526…

Interval arithmetic interpretation

Because floating point numbers have 53 bits of precision, all real numbers between 256 – 22 and 256 + 22 are represented as the floating point number 256. This is a range of width 8, wider than 2π, and so the sines of the numbers in this range cover the possible values of sine, i.e. [-1, 1]. So if you think of a floating point number as a sample from the set of all real numbers with the same binary representation, every possible value of sine is a valid return value for numbers bigger than 256. But if you think of a floating point number as representing itself exactly, then it makes sense to ask for the sine of numbers like 2333 and larger, up to the limits of floating point representation [2].

Related posts

[1] An IEEE 754 double has 52 significand bits, but these bits can represent 53 bits of precision because the first bit of the fractional part is always 1, and so it does not need to be represented. See more here.

[2] The largest exponent of an IEEE double is 1023, and the largest significand is 2 – 2-53 (i.e. all 1’s), so the largest possible value of a double is

(253 – 1)21024-53

and in fact the Python expression sin((2**53 - 1)*2**(1024-53)) returns the correct value to 16 significant figures.

Sequence alignment

In my previous post I illustrated the Levenshtein edit distance by comparing the opening paragraphs of Finnegans Wake by James Joyce and a parody by Adam Roberts.

In this post I’ll show how to align two sequences using the sequence alignment algorithms of Needleman-Wunsch and Hirschberg. These algorithms can be used to compare any sequences, though they are more often used to compare DNA sequences than impenetrable novels and parodies.

I’ll be using Gang Li’s implementation of these algorithms, available on github. I believe the two algorithms are supposed to produce the same results, that Hirschberg’s algorithm is a more space-efficient implementation of the Needleman-Wunsch algorithm, though the two algorithms below produce slightly different results. I’ll give the output of Hirschberg’s algorithm.

Li’s alignment code uses lists of characters for input and output. I wrote a simple wrapper to take in strings and output strings.

    from alignment import Needleman, Hirschberg

    def compare(str1, str2):
        seq1 = list(str1)
        seq2 = list(str2)
        for algorithm in [Needleman(), Hirschberg()]:
            a, b = algorithm.align(seq1, seq2)
            print("".join(a))
            print("".join(b))
            print()

The code inserts vertical bars to indicate spaces added for alignment. Here’s the result of using the Needleman-Wunsch algorithm on the opening paragraphs of Finnegans Wake and the parody Finnegans Ewok.

    |||riverrun, past Ev|e| and Adam'||||s,
    mov|i|er|un, past ||new and |||||hopes,

    from swe|rv||e of shore|||| to bend of
    from s||tr|ike of |||||back to bend of

    b|||ay, brings us by a commodius
    |jeday, brings us by a commodius

    vic|u||s of recirculation back to
    |||lucas of recirculation back to

    H|owth Ca||stle|||| and E|nvi||r|ons.
    |fo||||||rest||moon and |en||dor.||||

I mentioned in my previous post that I could compare the first four paragraphs easily, but I had some trouble aligning the fifth paragraphs. The fifth paragraphs of each version start out quite simiar:

    Bygme||ster Fi|nnega||n, of the 
    Bygm|onster ||Ann||akin, of the 
    
    Stutte||r|||||||ing Hand, f|re|emen'|s
    ||||||Throatchokin| Hand, for|cemen|’s
    
    mau-rer, lived in the broadest way
    mau-rer, lived in the broadest way
    
    immarginable in his rushlit
    immarginable in his rushlit
    
    toofar-|||back for messuages before
    toofar| — back for messuages before 

but then Roberts takes the liberty of skipping over a large section of the original. This is what I suspected by looking at the two texts, but Hirschberg’s algorithm makes the edits obvious by showing two long sequences of vertical bars, one about 600 characters long and another about 90 characters long.

Levenshtein distance from Finnegans Wake to Return of the Jedi

Ewok

I ran into a delightfully strange blog post today called Finnegans Ewok that edits the first few paragraphs of Finnegans Wake to make it into something like Return of the Jedi.

The author, Adam Roberts, said via Twitter “What I found interesting here was how little I had to change Joyce’s original text. Tweak a couple of names and basically leave it otherwise as was.”

So what I wanted to do is quantify just how much had to change using the Levenshtein distance, which is essentially the number of one-character changes necessary to transform one string into another.

Here’s the first paragraph from James Joyce:

riverrun, past Eve and Adam’s, from swerve of shore to bend of bay, brings us by a commodius vicus of recirculation back to Howth Castle and Environs.

And here’s the first paragraph from Adam Roberts:

movierun, past new and hopes, from strike of back to bend of jeday, brings us by a commodius lucas of recirculation back to forestmoon and endor.

The original paragraph is 150 characters, the parody is 145 characters, and the Levenshtein distance is 44.

Here’s a summary of the results for the first four paragraphs.

    |-------+---------+----------|
    | Joyce | Roberts | Distance |
    |-------+---------+----------|
    |   150 |     145 |       44 |
    |   700 |     727 |      119 |
    |   594 |     615 |      145 |
    |  1053 |     986 |      333 |
    |-------+---------+----------|

The fifth paragraph seems to diverge more from Joyce. I maybe have gotten something misaligned, and reading enough of Finnegans Wake to debug the problem made my head hurt, so I stopped.

Update: See the next post for sequence alignment applied to the two sources. This lets you see not just the number of edits but what the edits are. This show why I was having difficulty aligning the fifth paragraphs.

Related posts