MD5 hash collision example

Marc Stevens gave an example of two alphanumeric strings that differ in only one byte that have the same MD5 hash value. It may seem like beating a dead horse to demonstrate weaknesses in MD5, but it’s instructive to study the flaws of broken methods. And despite the fact that MD5 has been broken for years, lawyers still use it.

The example claims that

TEXTCOLLBYfGiJUETHQ4hAcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak

and

TEXTCOLLBYfGiJUETHQ4hEcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak

have the same hash value.

This raises several questions.

Are these two strings really different, and if so, how do they differ? If you stare at the strings long enough you can see that they do indeed differ by one character. But how could you compare long strings like this in a more automated way?

How could you compute the MD5 hash values of the strings to verify that they are the same?

The following Python code addresses the questions above.

from hashlib import md5
from difflib import ndiff

def showdiff(a, b):
    for i,s in enumerate(ndiff(a, b)):
        if s[0]==' ': continue
        elif s[0]=='-':
            print(u'Delete "{}" from position {}'.format(s[-1],i))
        elif s[0]=='+':
            print(u'Add "{}" to position {}'.format(s[-1],i))    

a = "TEXTCOLLBYfGiJUETHQ4hAcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak"
b = "TEXTCOLLBYfGiJUETHQ4hEcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak"

showdiff(a, b)

ahash = md5(a.encode('utf-8')).hexdigest()
bhash = md5(b.encode('utf-8')).hexdigest()
assert(ahash == bhash)

The basis of the showdiff function was from an answer to a question on Stack Overflow.

The output of the call to showdiff is as follows.

Delete "A" from position 21
Add "E" to position 22

This means we can form string b from a by changing the ‘A’ in position 21 to an ‘E’. There was only one difference between the two strings in this example, but showdiff could be useful for understanding more complex differences.

The assert statement passes because both strings hash to faad49866e9498fc1719f5289e7a0269.

Related posts

2 thoughts on “MD5 hash collision example

  1. > differ by only one byte

    Examining the binary values of the letter ‘A’ and the letter ‘E’, the difference is actually 1 bit.

  2. difflib is one of the hidden jewels of the Python standard library. I use it to implement “Perhaps you meant ____?” feedback in a few places in my API.

    Eg, if someone specifies “hamming”, I want the ValueError to hint “Perhaps you meant Hamming?”. If difflib doesn’t find a match to the main parameter list, I difflib a set of alternatives, so entering “manhattan” or “taxicab” will also suggest “Perhaps you meant Hamming?”.

Leave a Reply

Your email address will not be published. Required fields are marked *