Hash algorithms are widely used for a variety of tasks including verifying data integrity, authenticating passwords, and signing certificates. For the laymen the hash algorithm seems like a magic sausage grinder: dump in whatever you have and out comes a unique number. The trouble is, the internal mechanics of a hash algorithm are less like a simple mechanical machine and more like a Rube-Goldberg device. The internal mechanisms of bit shifting and mathematics would be incomprehensible to most who place blind faith in the algorithm’s ability to generate “unique” numbers.

The first illusion to dispel is uniqueness. Hash algorithms do not generate unique numbers, nor do they claim to. If you look at the problem logically the hash space is infinitely smaller than the possible inputs. MD5, for example, generates 128 bit hashes. If you have 129bits of data, then you end up with 2^{128} more unique inputs than possible hash numbers which implies collisions. Given data significantly larger than the hash number size, there will always be a possible collision differing by some small number of bits or bytes in the file.

So what use are hash functions? Functions such as MD5 or the SHA family belong to a special subset of hash functions known as “cryptographic hash algorithms” which have four properties which make them useful. From Wikipedia:

“*The ideal hash function has four main properties: it is easy to compute the hash for any given data, it is extremely difficult to construct a text that has a given hash, it is extremely difficult to modify a given text without changing its hash, and it is extremely unlikely that two different messages will have the same hash. These requirements call for the use of advanced cryptography techniques, hence the name.*”

– – Wikipedia

The key element under attack for MD5 in particular is the difficulty in constructing a text with a given hash. This is a major danger because the ability to generate data to match a hash at will leads to substitution attacks where data signed or verified using an MD5 hash can be substituted for modified or malicious data which passes hash verification.

Several publications in the last few years dented MD5’s armor and it is now showing signs of breakage. Bruce Schneider has some more in-depth reporting on all things cryptographic but the two most interesting real-world examples of collision research are the Post-Script, X.509, and Executable collisions.

The Post-Script and Executable collisions are nice demonstrations where both involve changing little bits of data in a type of conditional to put it simply. In the former it causes a different message to render when the post-script is interpreted and in the later it causes a different segment of code to execute. The X.509 collision is a little more interesting in that it has a more realistic theoretical attack vector in allowing man in the middle HTTPS attacks.

A lot of this research has been spurred on by the Wang et al paper, and it is important to note that SHA1 has had some weaknesses exposed as well. So far there have been no demonstrations of simultaneous collisions in multiple hash spaces. Being logically more difficult to simultaneously collide in more than one hash space, it is prudent to use multiple hash algorithms for verification purposes and to use the most robust possible hash algorithms when possible.

New research has been released today at the Chaos Computer Club’s conference regarding MD5 collision attacks. I have not finished reviewing the research but will be blogging about it after I have a chance to digest it (no pun intended).