XOR'd play: Normalized Hamming Distance

I've been playing around with the matasano crypto challenges for my own edification. Let me say up front, I'm a noob when it comes to crypto. I've used gpg, pgp, OpenSSL, etc. as a consumer of crypto products for a long time, but I've never really peeled back the onion and honestly, I'm not deep enough into Matasano's challenges to really have increased my understanding of modern crypto much at all.

My point is, I'm learning and I may say some things here out of ignorance. Take that as a disclaimer.

Also, this post is going to be ridiculously long. TLDR; In my testing, normalized Hamming Distance alone accurately returned the correct repeating XOR key size in about 40% of cases, but there's a simple calculation you can add that can increase this to over 90%.

In the challenges, Matasano introduces Hamming Distance, something I'd read about previously, but only in passing. Hamming Distance is the measure of difference between two strings. For example, the Hamming Distance between "fuse" and "fuel" is two because two characters would have to be changed to make the two strings the same.

For the crypto challenges, Hamming Distance is calculated at the bit level rather than at the character level as above, so to get the Hamming Distance bit-by-bit, we convert the strings to a byte arrays, then convert each byte to bits, then count the number of differences:

PS> GetByteArray "fuse" | ForEach-Object { GetBits $_ }
PS> GetByteArray "fuel" | ForEach-Object { GetBits $_ }

Obviously there's no difference in the bits for the first two bytes. Let's stack the bit strings for the last two bytes one after the other, it'll make counting the differences easier. I've highlighted the columns with differences:

01110011 s  01100101 e
01100101 e  01101100 l

In total then, there are five different bits between the two strings. The Hamming Distance then is five. Perceptive readers may notice that if the bytes above are XOR'd against one another, the result is that five bits are set to 1:

    01110011 s      01100101 e
XOR 01100101 e  XOR 01101100 l
--------------  --------------
    00010110        00001001
So we can use XOR for each pair of bytes, count the 1 bits and the result is the Hamming Distance. Incidentally, Hamming Distance, when calculated at the bit level as above, is equivalent to calculating the Index of Coincidence, which is useful to cryptographers attempting to crack polyalphabetic substitution ciphers such as Vigenère cipher or repeating XOR key encryption, which is what the initial set of challenges are built around.

Repeating XOR key encryption is a simple encryption scheme and not one that is terribly secure. Here's an example of how it works. Let's take our "fuse fuel" and XOR it with the key "few."

Converting our strings to bytes:

102:117:115:101:032:102:117:101:108  f:u:s:e: :f:u:e:l
102:101:119  f:e:w

In repeating XOR, since our key is shorter than our plaintext, we repeat the key, so the bytes line up as follows:

102:117:115:101:032:102:117:101:108  f:u:s:e: :f:u:e:l
102:101:119:102:101:119:102:101:119  f:e:w:f:e:w:f:e:w

Now we XOR the values together and the result in bytes is:


Notice wherever the same bytes lined up with each other above, they cancel each other out, this is also true for the bits, to see this in action, consider the second byte pair:

117 in bits 01110101         01110101
101 in bits 01100101  XOR'd: 01100101
                             00010000 in decimal 016 as above.

The challenge says that Normalized Hamming Distance can be used to calculate probable XOR key size. An algorithm is given, but there's a little ambiguity in the directions, so you still have to figure out the details. Essentially, you take the first n bytes of the ciphertext and calculate the Hamming Distance of those bytes against the next n bytes and divide the result by the key size. Whichever size yields the smallest Normalized Hamming Distance is probably the key size.

Let's walk through this, but for the sake of demonstration, we'll modify our plaintext to the following:

"fuse fuel for falling flocks"

Yes, it's a silly string, but I think it will make explaining things easier due to it's length. XOR encrypted with our key, we get the following byte array:


If we apply Matasano's algorithm, it's not actually their algorithm, but I'm calling it that here because lazy, we read a couple bytes of our ciphertext, then the next two bytes and get the Hamming Distance. We can repeat this up to the end of the ciphertext and average all the values together, divide by the number of bytes (two initially), that's the "normalization" part. We repeat this for three bytes, then four, then five and so on, up to 14 bytes for this string, since it's length is 28 bytes and we need to difference two strings of bytes.

Let's look at a few examples, I'm going to refer the Hamming Distance as HD, Average Hamming Distance as AvgHD and Normalized Average Hamming Distance as NAvgHD. Let's do some math:

   000:016    004:003    069:017    019:000    027:070    003:024
HD 004:003 HD 069:017 HD 019:000 HD 027:070 HD 003:024 HD 020:069
---------- ---------- ---------- ---------- ---------- ----------
         4          4          6          4          7          9

If you doubt any of the above is correct, convert the bytes to bits, XOR them against each other and count the 1s:

    00000000:00010000 (000:016)
XOR 00000100:00000011 (004:003)
    00000100:00010011 (four ones) for an HD of 4

If we keep going like this across the entire 28 byte string, we'll calculate the HD for 14 pairs of bytes, sum up the 14 values and get an AvgHD of 5.61538461538462, divide that by two, because we were using two byte pairs and you get a NAvgHD of 2.80769230769231.

Now we repeat this, but we do it for byte pairs of three bytes each:

   000:016:004    003:069:017    019:000:027    003:024:020
HD 003:069:017 HD 019:000:027 HD 003:024:020 HD 069:017:007
-------------- -------------- -------------- --------------
             9              6              7              8

Again all we're doing is XORing together the corresponding pairs, then counting the 1s in the results. If we do this for the first 13 three byte pairs, add up the results and divide by the number of triplet pairs we'll get an AvgHD of 7.625, which is actually slightly higher than the AvgHD we had with our two byte pairs, but we're interested in he NAvgHD, so we divide that number by three, the length of our byte pairs and we get a NAvgHD of 2.54166666666667, which is lower than our NAvgHD for a two byte key.

We can repeat this process for byte pairs up to a length of 14 bytes, the first half of the ciphertext against the second half. The results look like this:

CalcKeySize            AvgHD           NAvgHD
-----------            -----           ------
          2 5.76923076923077 2.88461538461538
          3            7.625 2.54166666666667
          4 12.3333333333333 3.08333333333333
          5            13.25             2.65
          6 14.6666666666667 2.44444444444444
          7 19.6666666666667 2.80952380952381
          8             22.5           2.8125
          9               20 2.22222222222222
         10               26              2.6
         11               24 2.18181818181818
         12               26 2.16666666666667
         13               34 2.61538461538462
         14               38 2.71428571428571

Sorted by NAvgHD, because the byte pair size with the smallest NAvgHD is probably our key size shows:

CalcKeySize            AvgHD           NAvgHD
-----------            -----           ------
         12               26 2.16666666666667
         11               24 2.18181818181818
          9               20 2.22222222222222
          6 14.6666666666667 2.44444444444444
          3            7.625 2.54166666666667
         10               26              2.6
         13               34 2.61538461538462
          5            13.25             2.65
         14               38 2.71428571428571
          7 19.6666666666667 2.80952380952381
          8             22.5           2.8125
          2 5.76923076923077 2.88461538461538
          4 12.3333333333333 3.08333333333333

But our key was "few," that's three bytes, why isn't it at the top of the list? Matasano's instructions do state the value with the lowest NAvgHD is probably the key size, they don't say it will be the key size. In their example, the value with the lowest NAvgHD is the key size, but in my testing, the NAvgHD is not always the key size, as you can clearly see above.

Let's dive into this a bit more.

The HDs for byte pairs of nine bytes:
HD 070:003:024:020:069:017:007:009:027

HD 015:011:016:070:003:027:009:006:028

 AvgHD: (17 + 23) / 2 = 20
NAvgHD:        20 / 9 =  2.22222222222222

If you compare the AvgHD for smaller key sizes with the AvgHD for the larger key sizes, you'll see the smaller ones have lower AvgHDs, this makes sense if you think about the process of comparing sets of larger numbers with sets of smaller numbers, you're more likely to have larger differences as the upper bound on your numbers increases -- consider the possible differences between numbers 01 and 09 and between 001 and 999.

What really throws things off is the normalization, when we divide AvgHD by the key size, but you may have noticed that most of the values with lower NAvgHDs are multiples of the actual key size. In my testing across 20+ different plaintext samples, this trend appears again and again when calculating Hamming Distance to find the repeating XOR key size.

Matasano's instructions say, "the KEYSIZE with the smallest normalized edit distance is probably the key." Edit distance is another way of saying Hamming Distance, apparently. They go on to say, "You could proceed perhaps with the smallest 2-3 KEYSIZE values. Or take 4 KEYSIZE blocks instead of 2 and average the distances."

In their example, the key size with the smallest normalized HD is in fact the key, I thought maybe that was because the maximum key size they ask you to try is 40 bytes and the correct key size is not something that divides evenly into 40, but even increasing the maximum key size to a couple multiples over the correct key size does not cause a larger key size to rise to the top, but again, in my tests over 20+ different sample plaintexts encrypted thousands of times with random keys from 2 to 100 bytes, that's the exception.

For my testing, admittedly limited given the infinite number of possible plaintext messages, I took 20 different samples of text from various sources, blogs, books, both technical and fiction -- most ranging from a paragraph to a page in length, sorry I won't be more precise, again, lazy. I ran these texts through a script that would generate a random key from 2 to 100 bytes in length, random bytes in the range 0x00 through 0xFF, the encrypted text was then passed through the HD algorithm as above and since I knew the actual key size, I set an upper bound on the key sizes I would try to be three times the actual key size. I selected the top five values in terms of lowest NAvgHD for each ciphertext. I ran this test 7758 times.

In 38% of the cases, the key size with the lowest NAvgHD was the correct key size.
In 33% of the cases, the key size with the second lowest NAvgHD was the correct key size.
In 28% of the cases, the key size with the third lowest NAvgHD was the correct key size.

During my experimentation, the correct key size landed outside of the top three values a few times, but very rarely and I was still tweaking the algorithm to get it correct as I made a fence post error that took me some time to spot and resolve. I'm not counting those values in my final tally. The numbers above are from the last tests I ran after getting all bugs worked out of the algorithm, but your mileage may vary.

Still, I found the above numbers discouraging. If I'm going to write a tool for brute force decrypting XOR encrypted data and the first step in the process is to get the correct key size, I want the algorithm to be more reliable than ~40%. I could do what Matasano suggests and simply try the first three values, but I suspected work could be done to improve this.

I believe I was right, but again, my sample size is small, but the results across that sample size are promising and the solution was simple, so simple that you've probably already thought of it. I'm sure many others have, but I found no references to it, possibly because search engines fail, or I fail at using them, or because my solution is bogus and just happened to work for my test cases and methodology.

Here's what I did, I saved the top five results -- the five most probable XOR key sizes based on NAvgHD -- then calculated the greatest common denominator (GCD) for the first two key sizes, then the first and third key sizes, then the second and third key sizes. The results were assigned to variables, GCD12, GCD13 and GCD23.

 If the following conditions are met, GCD12 is returned as the most probable key size: 

  1. GCD12 is not one and
  2. GCD12 is a key size in the top n key sizes and
    1. GCD12 is equal to GCD13 and GCD23 or
    2. GCD12 is equal to either the first or second most probable key size
    3. If the above conditions are not met, the script returns the key size with the smallest NAvgHD as the most probable key size.

How well does this algorithm work? In the testing above, all 7758 cases, I applied this algorithm and the correct key size was returned as the most probable key size 100% of the time.

Recall that in my test script, I'm calculating HDs for all key sizes up to three times the size of the actual key. This does impact the outcome of the script. As a test of this, I changed the script to calculate HDs for all combinations up to four times the actual key size. The results came out as follows across 2574 test runs:

The value with the best NAvgHD was the correct key size 893 times or 34% of the time.
The value with the second best NAvgHD was the correct key size 714 times or 27% of the time.
The value with the third best NAvgHD was the correct key size 537 times or 20% of the time and the value with the fourth best NAvgHD was the correct key size 430 times or 16% of the time.

All percentages have been rounded down, stop nitpicking my numbers.

As you can see increasing the trial key sizes by one more multiple of the actual key size causes the HD algorithm to be come less accurate and unfortunately in real-world cases, we have no knowledge of the actual key size.

But how did my algorithm enhancements fair in this test run? Out of the 2574 test runs, my script selected the correct key size 2108 times or 81% of the time, not bad, but I believed I could do better. I also wanted to add more sample texts. So I did.

I downloaded a bunch of English language books from http://www.gutenberg.org/ and refactored my plaintext function to return random samples from these texts. I did go through each of the files and remove the Project Gutenberg licensing information. I also removed some books that were framed in boxes made up of ASCII characters like pipes, dashes and underlines. I initially ran with these books in, but my numbers were skewed, so I investigated and took them out. I also removed duplicate texts and I only left in a few of the books from the Bible as I thought they may have an undo influence on my results. In the end, I had 80 books total, comprised of 353422 non-blank lines, 3646553 words and 20415200 characters.

The average input to my XOR encryption was 196 words comprised of 1160 characters.

For expediency, I reduced the maximum encryption key size to 40 bytes on my initial run. The minimum key size was 2 bytes. The key sizes to be tried was set to 4 times the actual key size and I opted to save the top six most probable key sizes for further review.

I did make a significant change to the script. In the previous testing, I'd calculated the GCDs for the first three most probable key sizes. I modified the script to calculate the GCDs for every combination of the top n most probable keys. Since I was extracting the top six keys in this run, I was calculating the GCDs for those six key sizes in every possible combination. I also added a parameter for MaxNAvgHD -- the highest NAvgHD I would accept for the probable key size in cases where it could not be clearly determined, the default is 3.5.

For evaluating the top six (in this case) most probable keys, I put the data through these steps:

  1. Calculate the GCDs across every pair of values in the top n values and add them to a hashtable with a count of frequency of occurrence. Each time a given GCD appears, increment its counter in the hashtable.
  2. Get the most frequently occurring GCD (MFGCD) from the hashtable.
  3. If the MFGCD is in the list of the top n most probable key sizes and its NAvgHD is less than the MaxNAvgHD, then return MFGCD as the most Probable Key Size.
 If the conditions above are met, the likelihood that MFGCD is the correct key size is high. I haven't done the analysis to see what percentage of cases this is true for, but that analysis could be done. If those conditions are not met, do the following:

  1. Take the smaller of the first two most probable key sizes as a possible key size.
  2. Take the most probable key size with the smallest NAvgHD as a possible key size.
  3. If the values from 1 and 2 are the same, return that value as the most probable key size.
  4. If the values from 1 and 2 are not the same, if the first value has a NAvgHD below the MaxNAvgHD, return it as the most probable key size, else return value 2.
 This last set of conditions gives more weight to the smaller of the top two values in the list of probable key sizes, but there's no guarantee that the smaller value is more likely to be the correct key size. However, the next step after attempting to calculate the correct key size is to break the encryption via brute force and it's going to be less expensive to try smaller keys than larger ones, so favoring the smaller key size when you're less certain as to the correctness of either key seems like the right thing to do.

Given these changes in my script, what was the outcome? In this test run there were 3897 samples encrypted with random keys from 2 to 40 bytes.

The script returned the correct key size as the most probable key size 97% of the time.
The key size with the lowest NAvgHD was the correct key size 29% of the time.

Out of the 114 cases where the returned probable key size was not the actual key size, there were 79 cases where the top six most probable key sizes didn't even contain the correct key size. I found this to be the most troubling outcome of my testing, but diving into the data deeper, there was an explanation for 97% of these failures.

In 77 of these 79 cases, the key was equal to or longer than half the length of the plaintext, meaning the encryption key was not a repeating XOR key at all, some portion of the key may have repeated, but not the entire key. In fact, in 16 cases, the key was longer than the plaintext entirely, meaning the encryption key was effectively a one-time pad.

In the other two cases, I'm not sure what went wrong. In one, the key size was 14 bytes and the plaintext length was 35 bytes, the value with the lowest NAvgHD was seven and that was the most probable key size as returned by the script. It is a factor of 14, but that's little consolation. According to the output, 12, 11 and 17 all had lower NAvgHDs than 14. In the other failed case, the actual key size was 2 bytes and the plaintext was 2457 bytes in length. The script showed that a key size of three bytes had the lowest NAvgHD with 4, 7, 8, 6 and 5 rounding out the list. Clearly the most common GCD here would be two, but alas two was not in the list of most probable keys and so could not be returned as the most probable key size.

What of the other 35 cases where the correct key size was in the list of top six probable key sizes, but was not the most probable key size returned by the script? The results came out like this:

ActualKeySize ProbableKeySize Top6KeySizes
------------- --------------- ------------
18            4               18:4:17:22:14:32
34            31              34:31:35:30:29:6
37            35              37:35:39:17:24:22
8             6               8:6:4:3:7:5
24            14              24:14:22:10:12:18
40            28              40:28:32:35:14:56
18            6               18:6:14:22:17:27
25            10              50:25:40:11:10:29
29            25              29:25:33:8:30:12
22            19              22:19:32:36:33:12
18            9               36:18:9:27:3:45
37            18              37:18:33:39:35:4
19            15              19:15:23:32:4:28
28            25              28:25:35:19:33:17
35            25              35:25:33:15:11:31
12            6               24:12:6:30:18:25
28            25              28:25:9:32:34:17
15            6               15:6:12:18:13:16
32            23              32:23:9:54:27:41
24            21              24:21:27:35:26:29
28            18              28:18:10:19:12:21
6             3               6:3:9:8:4:5
25            11              25:11:36:16:14:9
18            15              18:15:28:25:24:33
26            15              26:15:11:8:19:16
28            22              28:22:12:7:25:13
21            10              21:10:23:31:27:16
23            16              23:16:29:31:24:15
24            22              24:22:14:11:10:30
21            6               42:21:48:55:6:15
15            6               15:6:18:9:12:17
29            28              29:28:12:14:19:23
15            12              15:12:10:4:11:2
21            3               21:3:18:9:24:6
12            4               12:24:4:8:16:20

Again, if the scripts most frequently occurring GCD calculations don't point out a clear winner, it puts more weight on smaller probable key sizes that having NAvgHD value below a user supplied threshold with 3.5 being the default value. This could probably use refining, but again, the script has a much higher probability of returning the correct key size (around 97% in my testing) than NAvgHD alone (around 40% or less in my testing).

It would be easy to have the script return multiple ProbableKeySizes for cases where it can't clearly pick a winner and have that list sorted by key size in ascending order before entering the brute-force decryption phase of its work.

As I said at the start of this post, I'm a crypto-noob and that's an insult to noobs everywhere. My methods may be fraught with errors and there are lots of smart people in the world who may skim this and laugh, but I wanted to share my analysis and the results because I found it interesting and even fun.

My code for this is written in PowerShell and is available at https://github.com/davehull/MCC/blob/master/sets/1/XOR-Test.ps1. For testing, I do something like:

1..100 | Foreach-Object { .\XOR-Test.ps1 -MinKeySize 2 -MaxKeySize 60 -top 5 -MaxNAvgHD 3.4 | Export-Csv -NoTypeInformation .\output_$_.csv }

The above will run the script 99 times. It will select some portion of text from text files in .\texts\, sold separately. That text will be encrypted with a random byte key of two bytes on the first pass, that ciphertext string will be passed to the code that attempts to calculate the key size, then the next iteration begins, selecting a new text value, picking a random three byte key and so on, saving the output to CSV files that you can analyze later using Excel or PowerShell.

Update: Running the same analysis again for 60 byte max key size, discarding the cases where the actual key size could not repeat because it was more than half the size of the plaintext, the script returned the correct key size as the most probable key size 99.5% of the time. By comparison, the key size with lowest NAvgHD was the correct key size 31% of the time.



Popular posts from this blog

Meterpreter's Migrate: Detection and Investigation with memtriage and memdumppe

Cracking repeating XOR key crypto