Not strictly about cryptography, but closely related. This is about hash codes.
It was a long time ago. Maybe 52 years. I started as an engineering student at the university, and they wanted all of us to learn how to use the computer. Seriously, it was the computer. They had one. We all took the (non credit) course as part of freshman orientation.
You need to know right now that a large segment of the freshman engineering class had little to no interest in the computer. That meant that I and a few others (out of a class of hundreds) had the computer to ourselves. We spent a lot of our off time programming all sorts of stuff. And there was something about this computer. In fact, it was something about all computers in those days. There were no user accounts. And there were no passwords. You just sat down at the computer, and you were on.
Those were the good days. Later, working for a company, we obtained access to the university’s CDC-6600 through a phone link. It was shared access with user accounts. And passwords. I began to become acquainted with the concept of passwords. Later at another company we purchased shared access to another company’s computer. It was over a phone link using a printer terminal. You typed, and responses were printed. There was no screen. When you entered the password, the printer went back and x’ed out the password that had printed. Passwords had to be kept secret.
Subsequently there were VAX computers all around with multiple accounts on each, and I eventually ended up sharing a cube with a computer whiz who really knew what was going on inside. And he explained password implementation concepts. It made sense immediately, and I wondered that I had not thought of it myself. It goes like this:
The computer will not let you use its services unless you enter the correct password assigned to your account. So you enter the proper string of letters and digits. The computer must be comparing them with the expected string of letters and digits to determine you have entered the correct password. It must have a copy of the password stored somewhere.
Absolutely not. This would be a dangerous thing to do. If the password is stored somewhere, then some snoop, at least anybody with administrator privilege could copy the password and use it to spoof your login. Very bad. So your password is not stored.
What is done, instead, is to use the password string you have selected and use that string in a computation to produce a new string. That resulting string is all that is stored. Your password is never stored. When you enter your password, the computer keeps it in volatile memory only long enough to perform the computation. Then it discards it. The computer compares the result of the computation with the string it has stored and determines whether the results match.
But, you say, if I can discover the computation method, then I can just take the stored string and reverse the computation to obtain the user password. You could do that. If you could reverse the computation. What is used is a computation that is very difficult to reverse. Enter the concept of a hash function.
A hash function, in simplest terms, takes given value, a number or a string and scrambles it so badly the result is from all appearances completely random. In a useful implementation a hash code performs a “many to one” mapping. Here’s a short explanation of “many to one.” See the diagram:
The two ellipses represent mathematical “spaces.” The difference in size of the two ellipses represents that one space is larger than the other. The larger space contains more points than the smaller one. Now you want to establish a relationship between points in A to points in B. Since A has more points, multiple points in A need to map to the same point in B. This is a “many to one” mapping.
What can you do with this concept?
Suppose you want to capture the essence of a large document. You want this essence to be small and portable, but you still want it to be unique to this document. A hash code is one way to capture the essence of a large data set.
The MD5 message-digest algorithm is a widely used cryptographic hash function producing a 128-bit (16-byte) hash value, typically expressed in text format as a 32 digit hexadecimal number. MD5 has been utilized in a wide variety of cryptographic applications, and is also commonly used to verify data integrity.
MD5 was designed by Ron Rivest in 1991 to replace an earlier hash function, MD4. The source code in RFC 1321 contains a “by attribution” RSA license.
[Some links removed]
The reader is invited to visit the link to RFC 1321. This is where I obtained my original code for MD5 shortly after it came out. “RFC” stands for Request For Comment, and it is the means by which the Internet Engineering Task Force (IETF) reviews, accepts and publishes standards for the Internet.
If you want to use MD5 on your computer you can obtain the code from RFC 1321 and compile it on your computer. For this exercise I’m using a copy obtained with Cygwin, an application that allows users to run a UNIX-like environment on Windows computers.
Here’s a demonstration. I obtained an article published in the newsletter of the North Texas Skeptics in October 2001. The item is wordy, and I won’t reproduce all of it here. I will exhibit one paragraph. The article is about creationists’ use of phony academic credentials, and I discussed creationist Kent Hovind. Among other things I covered the misuse creationists often make of protein sequence comparisons between different biological species:
Furthermore, comparison of sequences for the different organisms show what should be expected from evolution. Although cytochrome c performs much the same function in the different organisms it shows these differences due to random DNA copying errors during reproduction. As long as the resulting protein performs a useful (and required) function in the descendent organism, the descendent will thrive and reproduce, and the error will be retained in the subsequent lineage. The further along the line of descent a particular organism is the more accumulated change there will be. If a lineage branches, as during the formation of a new species, the chain of differences will diverge, as well. The result is that the accumulated differences between two living organisms marks the amount of change since the two lineages diverged.
Ignore, for a moment, my misspelling of “descendant” in the article.To perform this demonstration, I captured the text of the article into a text file, and then I executed MD5 with the file as input. Here’s what I got:
$ md5sum message.txt
The first line is from my computer screen when I executed md5sum, the Cygwin MD5 implementation. The command line parameter is “message.txt,” the name of my text file. The second line is what md5sum printed out. It incorporated every character in the file, not just the text quoted above, and used them in the computation of a 32-digit hexadecimal value. It digested the entire file into 32 hexadecimal characters. And that’s how MD5 gets it’s name. “MD5” stands for “message digest 5.” This is the fifth version of Ron Rivest’s message digest utilities.
How is this useful?
Suppose somebody pays me to do some research and produce a lengthy report. I do all of this and produce a big write-up. I send my write-up to the person who paid me for the report. And I get a phone call. My client says he received the report, but some things do not look right. It’s possible somebody mucked with the report before passing it on to him. He may not have a legitimate copy.
Rather than having my client send me his copy of the report so I can go over it line by line, I ask instead that he just execute MD5 against his copy. This is assuming the original is an electronic file. My client does this and then reads back the 32-digit hash code to me over the phone. I compare what he tells me with my own copy of the hash code. It’s easy to do. There are only 32 digits to compare.
Let’s see what would happen if somebody did muck with the report. Here is another version of the previous quote. Only I have made a minor change. Can you find the change?
Furthermore, comparison of sequences for the different organisms show what should be expected from evolution. Although cytochrome c performs much the same function in the different organisms it shows these differences due to random DNA copying errors during reproduction. As long as the resulting protein performs a useful (and required) function in the descendent organism, the descendent will thrive and reproduce, and the error will be retained in the subsequent lineage. The further along the line of descent a particular organism is the more accumulated change there will be. If a lineage branches, as during the formation of a new species, the chain of differences will diverge as well. The result is that the accumulated differences between two living organisms marks the amount of change since the two lineages diverged.
I made the change in the message.txt file and ran md5sum against the changed file. Here is the result:
$ md5sum message.txt
For comparison, here are the two hash codes side by side:
A one-character change in the file made this difference in the hash code. Use of a hash function such as MD5 makes it very easy to detect minor differences between large documents. This works as well for binary images such as photographs.
I have a Canon digital camera, and it has a feature that embeds a hash code into each image. Canon keeps the details to itself, but the company provides a service, if needed, to validate images from their cameras.
Suppose I photograph a criminal act in progress, and later I bring my photo to court as evidence against the accused. The defense attorney challenges the validity of my photograph. I have Photoshopped his client’s license plate number onto the license plate of the getaway car in the photo. I say, here is the RAW image from the camera. Canon verifies this is indeed a RAW image from my particular camera, and furthermore it has not been altered. The criminal is off to the slammer, cursing hash codes all the way.
What makes it hard to defeat hash code verification? Two things. First the mapping is many to one. You could attempt to demonstrate the hash code in question actually represents a different data set. After all, this is a many to one mapping, so this hash code could have come from a different file. Hold that for a moment and go on to the next point. It is very difficult to back out a hash code computation. The emphasis is on the word “very.”
One way to back out the computation of a hash code is to find another data set that produces the same hash code. Good luck. A 32-digit hexadecimal number is a large value, and you really do not have time for the task of coming up with an alternative data set. The nature of hash code functions such as MD5 is that they offer no assistance to anybody attempting to come up with an alternative data set. There is no guidance in the search. The slightest change produces a complete scrambling of the hash code.
But back to the first point again. Suppose you did find (eureka!) an alternative data set. That data set would have to make sense. A nonsense strings of characters would not pass for the claim that the alternative text is valid.
My statement is a bit strong that backing out an MD5 calculation is problematic. But only a bit. That’s the conclusion of further research:
In 1996 a flaw was found in the design of MD5. While it was not deemed a fatal weakness at the time, cryptographers began recommending the use of other algorithms, such as SHA-1—which has since been found to be vulnerable as well. In 2004 it was shown that MD5 is not collision resistant. As such, MD5 is not suitable for applications like SSL certificates or digital signatures that rely on this property for digital security. Also in 2004 more serious flaws were discovered in MD5, making further use of the algorithm for security purposes questionable; specifically, a group of researchers described how to create a pair of files that share the same MD5checksum. Further advances were made in breaking MD5 in 2005, 2006, and 2007. In December 2008, a group of researchers used this technique to fake SSL certificate validity, and CMU Software Engineering Institute now says that MD5 “should be considered cryptographically broken and unsuitable for further use”, and most U.S. government applications now require the SHA-2 family of hash functions. In 2012, the Flame malware exploited the weaknesses in MD5 to fake a Microsoft digital signature.
[Some links removed]
All that aside, MD5 is useful for verification of data integrity internally. In a non-hostile environment, where people are not trying to crack your security, and you just want to verify you have the correct copy of your data set, then MD5 is careful, quick and kind. I have most recently used it to verify the software build I was submitting to automated test was the correct and uncorrupted version designated for the test. Try it. You’ll like it.