Hash function menagerie

Here’s an oversimplified survey of cryptographic hash functions: Everyone used to use MD5, now they use some variation on SHA.

There’s some truth to that. MD5 was very popular, and remains popular years after it was proven insecure. And now variations on SHA like SHA1 and SHA256 are commonly used. But there are a lot more cryptographic hash functions in common use. Continue reading

Reversing an MD5 hash

The MD5 hashing algorithm was once considered secure cryptographic hash, but those days are long gone [1]. For a given hash value, it doesn’t take much computing power to create a document with the same hash.

Hash functions are not reversible in general. MD5 is a 128-bit hash, and so it maps any string, no matter how long, into 128 bits. Obviously if you run all strings of length, say, 129 bits, some of them have to hash to the same value. (Another win for the pigeon hole principle.)

And yet in practice it may be possible to reverse a hash, given some context. In the context of short, weak passwords, it may be possible to determine what text was hashed to create a particular value. All it may take is a quick web search [2]. For example, let’s take an insecure password like “password” and run it through a bit of Python to compute its MD5 hash.

>>> import hashlib
>>> def foo(x): 
...     print(hashlib.md5(x.encode('utf-8')).hexdigest())
>>> foo("password")
5f4dcc3b5aa765d61d8327deb882cf99

A Google search returns over 21,000 hits on 5f4dcc3b5aa765d61d8327deb882cf99, and the second one shows that it’s the hash of “password”.

If I try a slightly less naive password like “p@$$word” I still get several hits, indicating that the hash is part of a list of compromised passwords.

Not every hash of a short string can be reversed this way. If I hash my business phone number, for example, I get something that does not yet appear in Google searches. The hash could still be reversed easily, but it would take more than just a Google search.

See the next post for how salting can thwart web search attacks.

Related posts

[1] MD5 was invented in 1992 and the first flaw was discovered in 1996. Experts started moving away from MD5 at that time. More flaws were discovered over time. In 2010 the CMU Software Engineering Institute declared that MD5 was “cryptographically broken and unsuitable for further use.” It’s still being used, though not as much. MD5 still makes a useful checksum, though it’s not cryptographically secure.

[2] The same would be true of a secure hash of an insecure password. For example, SHA-256 is better than MD5, but you could look up the SHA-256 hash values of terrible passwords too. But MD5 hashes are easier to search on. In my experiments, I got far fewer hits searching on SHA-256 hash values.

If you’re trying to reverse hash values on your own computer without a web search, the MD5 value would require much less computation than the SHA-256 value.

Projecting Unicode to ASCII

Sometimes you need to downgrade Unicode text to more restricted ASCII text. For example, while working on my previous post, I was surprised that there didn’t appear to be an asteroid named after Poincaré. There is one, but it was listed as Poincare in my list of asteroid names.

Python module

I used the Python module unidecode to convert names to ASCII before searching, and that fixed the problem. Here’s a small example showing how the code works.

    import unidecode

    for x in ["Poincaré", "Gödel"]:
        print(x, unidecode.unidecode(x))

This produces

    Poincaré Poincare
    Gödel Godel

Installing the unidecode module also installs a command line utility by the same name. So you could, for example, pipe text to that utility.

As someone pointed out on Hacker News, this isn’t so impressive for Western languages,

But if you need to project Arabic, Russian or Chinese, unidecode is close to black magic:

>>> from unidecode import unidecode
>>> unidecode("北亰")
'Bei Jing '

(Someone has said in the comments that 北亰 is a typo and should be 北京. I can’t say whether this is right, but I can say that unidecode transliterates both to “Bei Jing.”)

Projections

I titled this post “Projecting Unicode to ASCII” because this code is a projection in the mathematical sense. A projection is a function P such that for all inputs x,

PP(x) ) = P(x).

That is, applying the function twice does the same thing as applying the function once. The name comes from projection in the colloquial sense, such as projecting a three dimensional object onto a two dimensional plane. An equivalent term is to say P is idempotent. [1]

The unidecode function maps the full range of Unicode characters into the range 0x00 to 0x7F, and if you apply it to a character already in that range, the function leaves it unchanged. So the function is a projection, or you could say the function is idempotent.

Projection is such a simple condition that it hardly seems worth giving it a name. And yet it is extremely useful. A general principle in user interface to design is to make something a projection if the user expects it to be a projection. Users probably don’t have the vocabulary to say “I expected this to be a projection” but they’ll be frustrated if something is almost a projection but not quite.

For example, if software has a button to convert an image from color to grayscale, it would be surprising if (accidentally) clicking button a second time had any effect. It would be unexpected if it returned the original color image, and it would be even more unexpected if it did something else, such as keeping the image in grayscale but lowering the resolution.

Related posts

[1] The term “idempotent” may be used more generally than “projection,” the latter being more common in linear algebra. Some people may think of a projection as linear idempotent function. We’re not exactly doing linear algebra here, but people do think of portions of Unicode geometrically, speaking of “planes.”

Sine of a googol

How do you evaluate the sine of a large number in floating point arithmetic? What does the result even mean?

Sine of a trillion

Let’s start by finding the sine of a trillion (1012) using floating point arithmetic. There are a couple ways to think about this. The floating point number t = 1.0e12 can only be represented to 15 or 16 significant decimal figures (to be precise, 53 bits [1]), and so you could think of it as a representative of the interval of numbers that all have the same floating point representation. Any result that is the sine of a number in that interval should be considered correct.

Another way to look at this is to say that t can be represented exactly—its binary representation requires 42 bits and we have 53 bits of significand precision available—and so we expect sin(t) to return the sine of exactly one trillion, correct to full precision.

It turns out that IEEE arithmetic does the latter, computing sin(1e12) correctly to 16 digits.

Here’s the result in Python

    >>> sin(1.0e12)
    -0.6112387023768895

and verified by Mathematica by computing the value to 20 decimal places

    In:= N[Sin[10^12], 20]
    Out= -0.61123870237688949819

Range reduction

Note that the result above is not what you’d get if you were first to take the remainder when dividing by 2π and then taking the sine.

    >>> sin(1.0e12 % (2*pi))
    -0.6112078499756778

This makes sense. The result of dividing a trillion by the floating point representation of 2π, 159154943091.89536, is correct to full floating point precision. But most of that precision is in the integer part. The fractional part is only has five digits of precision, and so we should expect the result above to be correct to at most five digits. In fact it’s accurate to four digits.

When your processor computes sin(1e12) it does not naively take the remainder by 2π and then compute the sine. If it did, you’d get four significant figures rather than 16.

We started out by saying that there were two ways of looking at the problem, and according to the first one, returning only four significant figures would be quite reasonable. If we think of the value t as a measured quantity, measured to the precision of floating point arithmetic (though hardly anything can be measured that accurately), then four significant figures would be all we should expect. But the people who designed the sine function implementation on your processor did more than they might be expected to do, finding the sine of a trillion to full precision. They do this using a range reduction algorithm that retains far more precision than naively doing a division. (I worked on this problem a long time ago.)

Sine of a googol?

What if we try to take the sine of a ridiculously large number like a googol (10100)? This won’t work because a googol cannot be represented exactly as a floating point number (i.e. as a IEEE 754 double). It’s not too big; floating point numbers can be as large as around 10308. The problem is that a number that big cannot be represented to full precision. But we can represent 2333 exactly, and a googol is between 2332 and 2333. And amazingly, Python’s sine function (or rather the sine function built into the AMD processor on my computer) returns the correct result to full precision.

    >>> sin(2**333)
    0.9731320373846353

How do we know this is right? I verified it in Mathematica:

    In:= Sin[2.0^333]
    Out= 0.9731320373846353

How do we know Mathematica is right? We can do naive range reduction using extended precision, say 200 decimal places.

    In:= p = N[Pi, 200]
    In:= theta = x - IntegerPart[x/ (2 p)] 2 p
    Out= 1.8031286178334699559384196689…

and verify that the sine of the range reduced value is correct.

    In:= Sin[theta]
    Out= 0.97313203738463526…

Interval arithmetic interpretation

Because floating point numbers have 53 bits of precision, all real numbers between 256 – 22 and 256 + 22 are represented as the floating point number 256. This is a range of width 8, wider than 2π, and so the sines of the numbers in this range cover the possible values of sine, i.e. [-1, 1]. So if you think of a floating point number as a sample from the set of all real numbers with the same binary representation, every possible value of sine is a valid return value for numbers bigger than 256. But if you think of a floating point number as representing itself exactly, then it makes sense to ask for the sine of numbers like 2333 and larger, up to the limits of floating point representation [2].

Related posts

[1] An IEEE 754 double has 52 significand bits, but these bits can represent 53 bits of precision because the first bit of the fractional part is always 1, and so it does not need to be represented. See more here.

[2] The largest exponent of an IEEE double is 1023, and the largest significand is 2 – 2-53 (i.e. all 1’s), so the largest possible value of a double is

(253 – 1)21024-53

and in fact the Python expression sin((2**53 - 1)*2**(1024-53)) returns the correct value to 16 significant figures.

Sequence alignment

In my previous post I illustrated the Levenshtein edit distance by comparing the opening paragraphs of Finnegans Wake by James Joyce and a parody by Adam Roberts.

In this post I’ll show how to align two sequences using the sequence alignment algorithms of Needleman-Wunsch and Hirschberg. These algorithms can be used to compare any sequences, though they are more often used to compare DNA sequences than impenetrable novels and parodies.

I’ll be using Gang Li’s implementation of these algorithms, available on github. I believe the two algorithms are supposed to produce the same results, that Hirschberg’s algorithm is a more space-efficient implementation of the Needleman-Wunsch algorithm, though the two algorithms below produce slightly different results. I’ll give the output of Hirschberg’s algorithm.

Li’s alignment code uses lists of characters for input and output. I wrote a simple wrapper to take in strings and output strings.

    from alignment import Needleman, Hirschberg

    def compare(str1, str2):
        seq1 = list(str1)
        seq2 = list(str2)
        for algorithm in [Needleman(), Hirschberg()]:
            a, b = algorithm.align(seq1, seq2)
            print("".join(a))
            print("".join(b))
            print()

The code inserts vertical bars to indicate spaces added for alignment. Here’s the result of using the Needleman-Wunsch algorithm on the opening paragraphs of Finnegans Wake and the parody Finnegans Ewok.

    |||riverrun, past Ev|e| and Adam'||||s,
    mov|i|er|un, past ||new and |||||hopes,

    from swe|rv||e of shore|||| to bend of
    from s||tr|ike of |||||back to bend of

    b|||ay, brings us by a commodius
    |jeday, brings us by a commodius

    vic|u||s of recirculation back to
    |||lucas of recirculation back to

    H|owth Ca||stle|||| and E|nvi||r|ons.
    |fo||||||rest||moon and |en||dor.||||

I mentioned in my previous post that I could compare the first four paragraphs easily, but I had some trouble aligning the fifth paragraphs. The fifth paragraphs of each version start out quite simiar:

    Bygme||ster Fi|nnega||n, of the 
    Bygm|onster ||Ann||akin, of the 
    
    Stutte||r|||||||ing Hand, f|re|emen'|s
    ||||||Throatchokin| Hand, for|cemen|’s
    
    mau-rer, lived in the broadest way
    mau-rer, lived in the broadest way
    
    immarginable in his rushlit
    immarginable in his rushlit
    
    toofar-|||back for messuages before
    toofar| — back for messuages before 

but then Roberts takes the liberty of skipping over a large section of the original. This is what I suspected by looking at the two texts, but Hirschberg’s algorithm makes the edits obvious by showing two long sequences of vertical bars, one about 600 characters long and another about 90 characters long.

Levenshtein distance from Finnegans Wake to Return of the Jedi

Ewok

I ran into a delightfully strange blog post today called Finnegans Ewok that edits the first few paragraphs of Finnegans Wake to make it into something like Return of the Jedi.

The author, Adam Roberts, said via Twitter “What I found interesting here was how little I had to change Joyce’s original text. Tweak a couple of names and basically leave it otherwise as was.”

So what I wanted to do is quantify just how much had to change using the Levenshtein distance, which is essentially the number of one-character changes necessary to transform one string into another.

Here’s the first paragraph from James Joyce:

riverrun, past Eve and Adam’s, from swerve of shore to bend of bay, brings us by a commodius vicus of recirculation back to Howth Castle and Environs.

And here’s the first paragraph from Adam Roberts:

movierun, past new and hopes, from strike of back to bend of jeday, brings us by a commodius lucas of recirculation back to forestmoon and endor.

The original paragraph is 150 characters, the parody is 145 characters, and the Levenshtein distance is 44.

Here’s a summary of the results for the first four paragraphs.

    |-------+---------+----------|
    | Joyce | Roberts | Distance |
    |-------+---------+----------|
    |   150 |     145 |       44 |
    |   700 |     727 |      119 |
    |   594 |     615 |      145 |
    |  1053 |     986 |      333 |
    |-------+---------+----------|

The fifth paragraph seems to diverge more from Joyce. I maybe have gotten something misaligned, and reading enough of Finnegans Wake to debug the problem made my head hurt, so I stopped.

Update: See the next post for sequence alignment applied to the two sources. This lets you see not just the number of edits but what the edits are. This show why I was having difficulty aligning the fifth paragraphs.

Related posts

Comparing bfloat16 range and precision to other 16-bit numbers

server rack

Deep learning has spurred interest in novel floating point formats. Algorithms often don’t need as much precision as standard IEEE-754 doubles or even single precision floats. Lower precision makes it possible to hold more numbers in memory, reducing the time spent swapping numbers in and out of memory. Also, low-precision circuits are far less complex. Together these can benefits can give significant speedup.

Here I want to look at bfloat16, or BF16 for short, and compare it to 16-bit number formats I’ve written about previously, IEEE and posit. BF16 is becoming a de facto standard for deep learning. It is supported by several deep learning accelerators (such as Google’s TPU), and will be supported in Intel processors two generations from now.

Bit layout

The BF16 format is sort of a cross between FP16 and FP32, the 16- and 32-bit formats defined in the IEEE 754-2008 standard, also known as half precision and single precision.

BF16 has 16 bits like FP16, but has the same number of exponent bits as FP32. Each number has 1 sign bit. The rest of the bits in each of the formats are allocated as in the table below.

    |--------+------+----------+----------|
    | Format | Bits | Exponent | Fraction |
    |--------+------+----------+----------|
    | FP32   |   32 |        8 |       23 |
    | FP16   |   16 |        5 |       10 |
    | BF16   |   16 |        8 |        7 |
    |--------+------+----------+----------|

BF16 has as many bits as a FP16, but as many exponent bits as a FP32. The latter makes conversion between BF16 and FP32 simple, except for some edge cases regarding denormalized numbers.

Precision

The epsilon value, the smallest number ε such that 1 + ε > 1 in machine representation, is 2e where e is the number of fraction bits. BF16 has much less precision near 1 than the other formats.

    |--------+------------|
    | Format |    Epsilon |
    |--------+------------|
    | FP32   | 0.00000012 |
    | FP16   | 0.00390625 |
    | BF16   | 0.03125000 |
    |--------+------------|

Dynamic range

The dynamic range of bfloat16 is similar to that of a IEEE single precision number. Relative to FP32, BF16 sacrifices precision to retain range. Range is mostly determined by the number of exponent bits, though not entirely.

Dynamic range in decades is the log base 10 of the ratio of the largest to smallest representable positive numbers. The dynamic ranges of the numeric formats are given below. (Python code to calculate dynamic range is given here.)

    |--------+-------|
    | Format | DR    |
    |--------+-------|
    | FP32   | 83.38 |
    | BF16   | 78.57 |
    | FP16   | 12.04 |
    |--------+-------|

Comparison to posits

The precision and dynamic range of posit numbers depends on how many bits you allocate to the maximum exponent, denoted es by convention. (Note “maximum.” The number of exponent bits varies for different numbers.) This post explains the anatomy of a posit number.

Posit numbers can achieve more precision and more dynamic range than IEEE-like floating point numbers with the same number of bits. Of course there’s no free lunch. Posits represent large numbers with low precision and small numbers with high precision, but this trade-off is often what you’d want.

For an n-bit posit, the number of fraction bits near 1 is n – 2 – es and so epsilon is 2 to the exponent es – n – 2. The dynamic range is

\log_{10}\left( 2^{2^{\text{\small\emph{es}}} } \right)^{2n-4} = (2n-4)2^{es}\log_{10}2

which is derived here. The dynamic range and epsilon values for 16-bit posits with es ranging from 1 to 4 are given in the table below.

    |----+--------+-----------|
    | es |     DR |   epsilon |
    |----+--------+-----------|
    |  1 |  16.86 | 0.0000076 |
    |  2 |  33.82 | 0.0000153 |
    |  3 |  37.43 | 0.0000305 |
    |  4 | 143.86 | 0.0000610 |
    |----+--------+-----------|

For all the values of es above, a 16-bit posit number has a smaller epsilon than either FP16 or BF16. The dynamic range of a 16-bit posit is larger than that of a FP16 for all values of es, and greater than BF16 and FP32 when es = 4.

Related posts

What is proof-of-work?

The idea of proof of work (PoW) was first explained in a paper Cynthia Dwork and Moni Naor [1], though the term “proof of work” came later [2]. It was first proposed as a way to deter spam, but it’s better known these days through its association with cryptocurrency.

If it cost more to send email, even a fraction of a cent per message, that could be enough to deter spammers. So suppose you want to charge anyone $0.005 to send you an email message. You don’t actually want to collect their money, you just want proof that they’d be willing to spend something to email you. You’re not even trying to block robots, you just want to block cheap robots.

So instead of asking for a micropayment, you could ask the sender to solve a puzzle, something that would require around $0.005 worth of computing resources. If you’re still getting too much spam, you could increase your rate and by giving them a harder puzzle.

Dwork and Naor list several possible puzzles. The key is to find a puzzle that takes a fair amount of effort to solve but the solution is easy to verify.

Bitcoin uses hash problems for proof-of-work puzzles. Cryptographic hash functions are difficult to predict, and so you can’t do much better than brute force search if you want to come up with input whose hashed value has a specified pattern.

The goal is to add a fixed amount of additional text to a message such that when the hash function is applied, the resulting value is in some narrow range, such as requiring the first n bits to be zeros. The number n could be adjusted over time as needed to calibrate the problem difficulty. Verifying the solution requires computing only one hash, but finding the solution requires computing 2n hashes on average.

Related posts

[1] Cynthia Dwork and Noni Naor (1993). “Pricing via Processing, Or, Combatting Junk Mail, Advances in Cryptology”. CRYPTO’92: Lecture Notes in Computer Science No. 740. Springer: 139–147.

[2] Markus Jakobsson and Ari Juels (1999). “Proofs of Work and Bread Pudding Protocols”. Communications and Multimedia Security. Kluwer Academic Publishers: 258–272.

Technological context

As you read this blog post, you probably have a hierarchy of contexts in the back of your mind. It comes so naturally to you that you don’t consciously think of it.

If you’re reading this in a web browser, you probably know what browser you’re using. If not, you’re at least aware that you are using a browser, even if you forget momentarily which one you have open. And you probably know what operating system is hosting your browser. You understand that you are reading a page on my site, that this page is not your browser, but content hosted by your browser. If you’ve subscribed via RSS or email, you know what email or RSS client you’re using and understand how this post is organized with respect to your other content.

Some people do not have this kind of context. Anything on their computer is simply “The Computer.” They don’t really understand what an operating system, web browser, or email client are. And they don’t need to know, most of the time. They can get their work done, but then occasionally they have inexplicable problems.

I’m not saying this to criticize or make fun of the people who don’t have the kind of technological context I’m talking about. It’s a remarkable achievement that software has gotten so easy to use that people can get along without knowing much technological context. But if this sort of thing is second nature to you, you might have a hard time understanding how a large number of people work.

You probably take it for granted that you can access the same web site from different computers. Some people do not. Their desktop at work is one computer, and their iPhone is a different computer. They don’t really understand what a web site is.

I know what a web browser is because I have been using computers since before there was a web. Old timers know what various technological components are because they’ve seen them develop. And “digital natives” know to get things done because they’ve grown up with computers, though their gaps in context show occasionally. Seems like the people in the middle would have the hardest time, not having grown up with the technology but not having watched it develop either.

I’m writing this because I’m becoming increasingly aware of how difficult life can be for people who don’t have adequate mental models for technology. I imagine most of my readers are tech savvy, and may have a hard time seeing some of the same things that I’ve had a hard time seeing, that a lot of people don’t understand things we take for granted.

Source http://dilbert.com/strip/1995-06-24

It used to be that anybody who used a computer had to know some basic things. If you were a Unix user a generation ago, you might not know anything about the internals of Unix, but you at least knew that you were a Unix user. There were wizards and mere mortals, but the two groups shared more context than the most tech savvy and least tech savvy share today.

It’s good that people don’t need to know as much context, but occasionally it produces bewildering situations, both for the user and the person trying to help them.

Windows command line tips

I use Windows, Mac, and Linux, each for different reasons. When I run Windows, I like to have a shell that works sorta like bash, but doesn’t run in a subsystem. That is, I like to have the utility programs and command editing features that I’m used to from bash on Mac or Linux, but I want to run native Windows code and not a little version of Linux hosted by Windows. [1]

It took a while to find something I’m happy with. It’s easier to find Linux subsystems like Cygwin. But cmder does exactly what I want. It’s a bash-like shell running on Windows. I also use Windows ports of common Unix utilities. Since these are native Windows programs, I can run them and other Windows applications in the same environment. No error messages along the lines of “I know it looks like you’re running Windows, but you’re not really. So you can’t open that Word document.”

I’ve gotten Unix-like utilities for Windows from several sources. GOW (Gnu on Windows) is one source. I’ve also collected utilities from other miscellaneous sources.

Tab completion and file association

There’s one thing that was a little annoying about cmder: tab completion doesn’t work if you want to enter a file name. For example, if you want to open a Word document foo.docx from the basic Windows command prompt cmd.exe, you can type fo followed by the tab key and the file will open if foo.docx is the first file in your working directory that begins with “fo.”

In cmder, tab completion works for programs first, and then for file names. If you type in fo followed by the tab key, cmder will look for an application whose name starts with “fo.” If you wanted to move foo.docx somewhere, you could type mv fo and tab. In this context, cmder knows that “fo” is the beginning of a file name.

On Mac, you use the open command to open a file with its associated application. For example, on the Mac command line you’d type open foo.docx rather than just foo.docx to open the file with Word.

If there were something like open on Windows, then tab completion wold work in cmder. And there is! It’s the start command. In fact, if you’re accustomed to using open on Mac, you could alias start to open on Windows [2]. So in cmder, you can type start fo and hit tab, and get tab completion for the file name.

Miscellaneous

The command assoc shows you which application is associated with a file extension. (Include the “.” when using this command. So you’d time assoc .docx rather than assoc docx.

You can direct Windows shell output to the clip command to put the output onto the Windows clipboard.

The control command opens the Windows control panel.

This post shows how to have branching logic in an Emacs config file so you can use the same config file across operating systems.

Related posts

[1] Usually on Windows I want to run Windows. But if I do want to run Linux without having to go to another machine, I use WSL (Windows Subsystem for Linux) and I can use it from cmder. Since cmder supports multiple tabs, I can have one tab running ordinary cmd.exe and another tab running bash on WSL.

[2] In the directory containing cmder.exe, edit the file config/user-aliases.cmd. Add a line open=start $1.