Identifying hash algorithms

Given a hash value, can you determine what algorithm produced it? Or what algorithm probably produced it?

Obviously if a hash value is 128 bits long, then a 128-bit algorithm produced it. Such a hash value might have been produced by MD5, but not by SHA-1, because the former produces 128-bit hashes and the latter a 160-bit hash.

The more interesting question is this: given an n-bit hash, can you tell which n-bit hash algorithm it probably came from? You will find contradictory answers online. Some people confidently say no, that a hash value is for practical purposes a random selection from its range. Others say yes, and point to software that reports which algorithm the hash probably came from. Which is it?

Some hashing software has a custom output format, such as sticking a colon at a fixed position inside the hash value. Software such as hashid uses regular expressions to spot these custom formats.

But assuming you have a hash value that is simply a number, you cannot know which hash algorithm produced it, other than narrowing it to a list of known algorithms that produce a number of that length. If you could know, this would indicate a flaw in the hashing algorithm.

So, for example, a 160-bit hash value could come from SHA-1, or it could come from RIPEMD-160, Haval-160, Tiger-160, or any other hash function that produces 160-bit output.

To say which algorithm probably produced the hash, you need context, some sort of modeling assumption. In general SHA-1 is the most popular 160-bit hash algorithm, so if you have nothing else to go on, your best guess for the origin of a 160-bit hash value would be SHA-1. But RIPEMD is part of the Bitcoin standard, and so if you find a 160-bit hash value in the context of Bitcoin, it’s more likely to be RIPEMD. There must be contexts in which Haval-160 and Tiger-160 are more likely, though I don’t know what those contexts would be.

Barring telltale formatting, software that tells you which hash functions most likely produced a hash value is simply reporting the most popular hash functions for the given length.

For example, I produced a 160-bit hash of “hello world” using RIMEMD-160

   echo -n "hello world" | openssl dgst -rmd160

then asked hashid where it came from.

    hashid '98c615784ccb5fe5936fbc0cbe9dfdb408d92f0f'
    Analyzing '98c615784ccb5fe5936fbc0cbe9dfdb408d92f0f'
    [+] SHA-1
    [+] Double SHA-1
    [+] RIPEMD-160
    [+] Haval-160
    [+] Tiger-160
    [+] HAS-160
    [+] LinkedIn
    [+] Skein-256(160)
    [+] Skein-512(160)

I got exactly the same output when I hashed “Gran Torino” and “Henry V” because the output doesn’t depend on the hashes per se, only their length.

Whether software can tell you where a hash probably came from depends on your notion of “probably.” If you find a 160-bit hash in the wild, it’s more likely to have come from SHA-1 than RIPEMD. But if you were to write a program to generate random text, hash it with either SHA-1 or RIPEMD, it would likely fail badly.

Related posts

Testing random number generators

Random number generators are subtle. Unless the generator is some physical device, random number generators (RNGs) are usually technically pseudorandom number generators (PRNGs), deterministic algorithms designed to mimic randomness.

Suppose you have a PRNG that produces the digits 0 through 9. How might you test the output to see whether it (acts like it) is random? An obvious test would be to see how often each digit is produced. As the number of samples n increases, you’d expect the frequency of each digit to approach n/10.

Starting with χ²

If your “random” number generator simply cycles through the digits 0 to 9 in order, the frequencies will match expectations. In fact, they will match too well. A two-sided χ² test will catch this problem. The χ² will be too small, indicating a suspiciously even distribution.

Nick Lord [1] gives a construction that has a much more subtle pattern. In his example, the frequencies also converge to the expect values, but the χ² statistic diverges to ∞ as n increases. So rather than producing too small a χ² value, his example produces too large a value. This shows that the χ² test is a stronger test than simply looking at frequencies, but it’s only a start.

RNG testing service

There are more sophisticated tests, and standard suites of tests: DIEHARDER, NIST STS, TestU01, etc. We’ve run these test suites for several clients. More on that here.

It’s curious how this work comes in spurts. We had a run of clients wanting RNG testing, then nothing for a long time, and now we have a new RNG testing project in the queue.

Related posts

[1] A Chi-Square Nightmare. The Mathematical Gazette, Vol. 76, No. 476 (July 1992), p. 274.

Limitations on Venn diagrams

Why do Venn diagrams almost always show the intersections of three sets and not more? Can Venn diagrams be generalized to show all intersections of more sets?

That depends on the rules you give yourself for generalization. If you require that your diagram consist of circles, then three is the limit. As John Venn put it in the original paper on Venn diagrams [1]

Beyond three terms circles fail us, since we cannot draw a fourth circle which shall intersect three others in the way required.

But Mr. Venn noted that you could create what we now call a Venn diagram using four ellipses and included the following illustration.

Venn diagram with four ellipses by John Venn

(It’s confusing that there is an X inside the diagram. Venn meant that to be an asterisk and not the same symbol as the X outside. He says in the paper “Thus the one which is asterisked is instantly seen to be ‘X that is Y and Z, but is not W’.” Maybe someone else, like the publisher, drew the diagram for him.)

So the answer to whether, or how far, it is possible to generalize the classic Venn diagram depends on permissible generalizations of a circle. If you replace circles with arbitrary closed curves then Venn diagrams exist for all orders. If you demand the curves have some sort of symmetry, there are fewer options. It’s possible to make a Venn diagram from five ellipses, and that may be the limit for ellipses.

A Venn diagram is a visualization device, and so an important question is what is the limit for the use of Venn diagrams as an effective visualization technique. This is an aesthetic / pedagogical question, and not one with an objective answer, but in my mind the answer is four. Venn’s diagram made from four ellipses is practical for visualization, though it takes more effort to understand than the typical three-circle diagram.

Although my upper bound of four is admittedly subjective, it may be possible to make it objective post hoc. A Venn diagram made from n curves divides the plane into 2n regions [2]. In order to use more than four curves, you either have to gerrymander the curves or else tolerate some regions being much smaller than others. The former makes the diagram hard to understand, and he latter makes it hard to label the regions.

I suspect that if you make precise what it means for a curve to be simple [3], such as using ellipses or convex symmetric curves, and specify a maximum ratio between the largest and smallest bounded regions, then four curves will be the maximum.

***

[1] John Venn. On the Diagrammatic and Mechanical Representation of Propositions and Reasonings. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. July 1880.

[2] This includes the outside of the diagram representing the empty set. The diagram shows the intersection of 0, 1, 2, …, n sets, and the intersection of no sets is the empty set. This last statement might seem like an arbitrary definition, but it can be justified using category theory.

[3] Simple in the colloquial sense, which is more restrictive than the technical mathematical sense of a simple curve.

Edit distance

Newspaper editor

I was just talking to a colleague about edit distance because it came up in a project we’re working on.

Technically, we were discussing Levenshtein distance. It sounds more impressive to say Levenshtein distance, but it’s basically how much editing effort it would take to turn one block of text into another.

Edit distance is a fairly simple idea, and very useful. For example, the Morse code practice site LCWO [1] reports the Levenshtein distance between what you transcribe and the correct response. This better matches your intuitive idea of how well you did than some more direct ways of measuring accuracy. For example, transposing two characters is less of an error that typing two unrelated characters.

The LCWO site is unusual in that it explicitly reports Levenshtein distance. It’s far more common for the algorithm to be used behind the scenes and never mentioned by name.

Related posts

[1] Why “LCWO”? It stands for “learn CW online.” In the ham radio community, Morse code is usually called CW for historical reasons. CW stands for “continuous wave.” In principle you’re always transmitting at the same frequency, turning a carrier wave on and off, not modulating it. The details are a bit more interesting, as I write about here.

Birthday problem approximation

The birthday problem is a party trick with serious practical applications. It’s well known to people who have studied probability, but the general public is often amazed by it.

If you have a group of 23 people, there’s a 50-50 chance that at least two people have the same birthday. With a larger group, say 30, its quite likely two birthdays are the same. Not only is this a theoretical result, based on certain modeling assumptions, it actually works in practice, essentially as predicted.

Variations of the birthday problem come up routinely in applications. For example, in cryptography it’s important to know the probability of secure hash collisions. Hash functions are deterministic, but for practical purposes they act random. If you are hashing into a space of N possible hash values, you can expect to compute about √N items before two items have the same hash value.

The square root rule of thumb is very useful. For example, if you’re computing 128-bit hash values, there’s about a 50-50 chance of seeing two duplicate hash values after hashing about 264 items.

The square root heuristic works well for large N, but gives mediocre results for N as small as 365. When applied to the original birthday problem, it predicts even odds for seeing a pair of equal birthdays in a group of 19 people. That’s a little low, but not too far off.

As useful as the square root rule is, it is only good for finding when the probability of duplication is 1/2. What if you’d like to know when the probability of a collision is, say, 0.01?

Let N be the number of possible options and let r be the number of items chosen independently from the set of N options. Let P(N, r) be the probability that all r choices are distinct. Then

P(N, r) ≈ exp( −r²/2N).

This approximation [1] is valid when N is large and r is small relative to N. We could be more precise about the error bounds, but suffice it to say that bigger N is better.

When N = 365 and r = 23, the approximation above computes the probability that all 23 choices are distinct as 0.48, matching the canonical birthday problem and showing an improvement over the square root heuristic.

Related posts

[1] Anthony C. Robin. The Birthday Distribution: Some More Approximations. The Mathematical Gazette, Vol. 68, No. 445 (October, 1984), pp. 204–206

(1 − z) / (1 + z)

“I keep running into the function f(z) = (1 − z)/(1 + z).” I wrote this three years ago and it’s still true.

This function came up implicitly in the previous post. Ramanujan’s excellent approximation for the perimeter of an ellipse with semi-axes a and b begins by introducing

λ = (ab)/(a + b).

If the problem is scaled so that a = 1, then λ = f(a). Kummer’s series for the exact perimeter of an ellipse begins by introducing the same variable squared.

As this post points out, the function f(z) comes up in the Smith chart from electrical engineering, and is also useful in mental calculation of roots. It also comes up in mentally calculating logarithms.

The function f(z) is also useful for computing the tangent of angles near a right angle because

tan(π/4 − z) ≈ f(z)

with an error on the order of z³. So when z is small, the error is very, very small, much like the approximation sin(x) ≈ x for small angles.

Error in Ramanujan’s approximation for ellipse perimeter

Ramanujan discovered an incredibly accurate approximation for the perimeter of an ellipse. This post will illustrate how accurate the approximation is and push its limits.

As with all computations involving ellipses, the error of Ramanujan’s approximation increases as eccentricity increases. But the error increases slowly, asymptotically approaching an upper bound that is remarkably small.

Let a and b be the semi-major and semi-minor axes of an ellipse. Then Ramanujan’s approximation for the perimeter of the ellipse is

\pi (a + b) \left(1 + \frac{3\lambda^2}{10 + \sqrt{4 - 3\lambda^2}}\right)

where λ = (ab)/(a + b).

Example

To illustrate how accurate the approximation is, let’s apply it to a very large, highly eccentric ellipse: the orbit of Sedna, a dwarf planet discovered in 2003. Sedna has the most elliptical orbit of any of the dwarf planets, 0.8496 compared to 0.2488 for Pluto. Sedna is also about 12 times further from the sun than Pluto.

The semi-major axis of Sedna’s orbit is 76 billion kilometers. Eccentricity e corresponds to aspect ratio √(1 − e²) (see this post), and so the semi-minor axis is about 40 billion kilometers. Let’s assume the orbit of Sedna is perfectly elliptical, which it is not, and that the semi-axes stated above are exact, which they are not. Then the length of Sedna’s orbit is on the order of 366 billion kilometers, and Ramanujan’s approximation has an error of about 53 kilometers.

Error

Here’s a plot of the relative error when b = 1 and a varies.

The error appears to be approaching an asymptote, and in fact it is. The error is bound by 4/π − 14/11 = 0.00051227…., as proved here.

The Cauchy distribution’s counter-intuitive behavior

Someone with no exposure to probability or statistics likely has an intuitive sense that averaging random variables reduces variance, though they wouldn’t state it in those terms. They might, for example, agree that the average of several test grades gives a better assessment of a student than a single test grade. But data from a Cauchy distribution doesn’t behave this way.

Averages and scaling

If you have four independent random variables, each normally distributed with the same scale parameter σ, then their average is also normally distributed but with scale parameter σ/2.

If you have four independent random variables, each Cauchy distributed with the same scale parameter σ, then their average is also Cauchy distributed but with exact same scale parameter σ.

So the normal distribution matches common intuition, but the Cauchy distribution does not.

In the case of random variables with a normal distribution, the scale parameter σ is also the standard deviation. In the case of random variables with a Cauchy distribution, the scale parameter σ is not the standard deviation because Cauchy random variables don’t have a variance, so they don’t have a standard deviation.

Modeling

Some people object that nothing really follows a Cauchy distribution because the Cauchy distribution has no mean or variance. But nothing really follows a normal distribution either. All probability distributions are idealizations. The question of any probability distribution is whether it adequately captures the aspect of reality it is being used to model.

Mean

Suppose some phenomenon appears to behave like it has a Cauchy distribution, with no mean. Alternately, suppose the phenomenon has a mean, but this mean is so variable that it is impossible to estimate. There’s no practical difference between the two.

Variance

And in the alternate case, suppose there is a finite variance, but the variance is so large that it is impossible to estimate. If you take the average of four observations, the result is still so variable that the variance is impossible to estimate. You’ve cut the theoretical variance in half, but that makes no difference. Again this is practically indistinguishable from a Cauchy distribution.

Truncating

Now suppose you want to tame the Cauchy distribution by throwing out samples with absolute value less than M. Now you have a truncated Cauchy distribution, and it has finite mean and variance.

But how do you choose M? If you don’t have an objective reason to choose a particular value of M, you would hope that your choice doesn’t matter too much. And that would be the case for a thin-tailed probability distribution like the normal, but it’s not true of the Cauchy distribution.

The variance of the truncated distribution will be approximately equal to M, so by choosing M you choose the variance. So if you double your cutoff for outliers that are to be discarded, you approximately double the variance of what’s left. Your choice of M matters a great deal.

Related posts

Arithmetic, Geometry, Harmony, and Gold

I recently ran across a theorem connecting the arithmetic mean, geometric mean, harmonic mean, and the golden ratio. Each of these comes fairly often, and there are elegant connections between them, but I don’t recall seeing all four together in one theorem before.

Here’s the theorem [1]:

The arithmetic, geometric, and harmonic means of two positive real numbers are the lengths of the sides of a right triangle if, and only if, the ratio of the arithmetic to the harmonic mean is the Golden Ratio.

The proof given in [1] is a straight-forward calculation, only slightly longer than the statement of the theorem.

The conclusion of the theorem stops short of saying how to construct the triangle, though this is a simple exercise, which we carry out here.

Given two positive numbers, a and b, the three means are defined as follows.

AM = (a + b)/2
GM = √ab
HM = 2ab/(a + b)

Denote the Golden Ratio by

φ = (1 + √5)/2.

Then the equation AM/HM = φ is equivalent to the quadratic equation

a² + (2 − 4φ)ab + b² = 0.

The means are all homogeneous functions of a and b, i.e. if we multiply a and b by a constant, we multiply the three means by the same constant. Therefore we can set one of the parameters to 1 without loss of generality. Setting b = 1 gives

a² + (2 − 4φ)a + 1 = 0

and so there are two solutions:

a = 2φ − 3

and

a = 2φ + 1.

However, there is in a sense only one solution: the two solutions are reciprocals of each other, reversing the roles of a and b. So while there are two solutions to the quadratic equation, there is only one triangle, up to similarity.

[1] Angelo Di Domenico. The Golden Ratio: The Right Triangle: And the Arithmetic, Geometric, and Harmonic Means. The Mathematical Gazette Vol. 89, No. 515 (July, 2005), p. 261

Ceva, cevians, and Routh’s theorem

I keep running into Edward John Routh (1831–1907). He is best known for the Routh-Hurwitz stability criterion but he pops up occasionally elsewhere. The previous post discussed Routh’s mnemonic for moments of inertia and his “stretch” theorem. This post will discuss his triangle theorem.

Before stating Routh’s theorem, we need to say what a cevian is. Giovanni Ceva (1647–1734) was an Italian geometer, best known for Ceva’s theorem, and for a construction in that theorem now known as a cevian.

A cevian is a line from the vertex of a triangle to the opposite side. Draw three cevians by connecting each vertex of a triangle to a point on its opposite side. If the cevians intersect at a point, Ceva’s theorem says something about how the lines divide the sides. If the cevians form a triangle, Routh’s theorem find the area of that triangle.

Routh’s theorem is a generalization of Ceva’s theorem because if the cevians intersect at a common point, the area of the triangle formed is zero, and then Routh’s area equation implies Ceva’s theorem.

Let A, B, and C be the vertices of a triangle and let D, E, and F be the points where their cevians intersect the opposite sides.

Let xy, and z be the ratios into which each side is divided by the cevians. Specifically let x = FB/AF, y = DC/BD, and z = EA/CE.

Then Routh’s theorem says the relative area of the green triangle formed by the cevians is

\frac{(xyz - 1)^2}{(xy + y + 1)(yz + z + 1)(zx + x + 1)}

If the cevians intersect at a point, the area of the triangle is 0, which implies xyz = 1, which is Ceva’s theorem.

Related posts