Microsoft Office has several available cryptographic options for encrypting office documents. In the 97-2003 doc format the RC4 stream cipher with 40bit key is used by default. The 40 bit key was chosen at a time when export restrictions on encryption technology existed. There have been publications exposing the weaknesses in the implementation beyond the 40bit key size, but the format is still the default in all office installation prior to 2007.
I recently had to read up on the implementation in Microsoft Office for word documents to solve part of the 2009 DC3 forensics challenge. I found the scattered nature of the documentation concerning it somewhat frustrating, and not as clear as it could be in demonstrating the process. I decided to document it here in a single location for posterity. This post will detail the verification process step by step with illustrations showing the process in pseudo code/math/memory. For the most part the images represent byte arrays manipulated with various functions, but should be fairly clear. They are SVG format images; if your browser does not support them you may view them in inkscape, but most modern browsers should handle them fine. This document does not cover the word document format itself or how to extract the encryption information – I’ll save that for another post later when I add some code for automatically extracting it.
The password verification process consists of five pieces of information.
- P = password [little endian unicode]
- S = Password Salt [16 bytes]
- V = Encrypted Verification Data [16 bytes]
- Vh = Encrypted Verification Data Hash [16 bytes]
- B = Block [always 0x00000000 for verification]
The S, V, and Vh are stored in the table field of the encrypted document. B, for verification, is always 0x00000000, and P is supplied by the user. The verification process has multiple steps each of which manipulate combinations of bytes derived from the above sources and the previous step to eventually generate a 128bit initialization key for the RC4 stream cipher.
Step 1. Accept input P from the user. We use P as input into the MD5 algorithm to generate a 128 bit (16 byte) hash H0 such that H0 = MD5(P).
Step 2. Take the first 5 bytes of H0 denoted as H0[0,4] and concatenate them with S to form a 21 byte field T such that T = H0[0,4] + S.
Step 3. Create a 336 byte intermediate buffer I. Fill I by concatenating T 16 times such that I = T + T + T … etc.
Step 4. Generate a 16 byte key H1 such that H1 = MD5(I)
Step 5. Take the first five byte of H1 denoted as H1[0,4] and generate the 16 byte key Hfinal such that Hfinal = MD5(H1[0,4] + B) where (H1[0,4] + B) is a 9 byte data sequence formed by concatenating the first five bytes of H1 with B.
Step 6. Initialize RC4 using Hfinal as the key.
Step 7. Decrypt V by taking XOR of each byte in V and the sequential bytes generated by RC4.
Step 8. DO NOT REINITIALIZE RC4. Decrypt Vh in the same manner V was decrypted.
Step 9. Test. If MD5(V) == Vh then P is valid, otherwise it is not.
If, after step 9, P is valid then the encryption key is Hfinal. For decrypting the document itself, each of the internal streams initializes RC4 using MD5(Hfinal + B) and decrypts 512 bytes of data where B is initially 0 and is incremented by 1 on every subsequent 512 byte block of the stream. RC4 is reinitialized every 512 bytes.
Common wisdom holds complex passwords (at least three different members of the four groups) of >= 8 characters are effectively secure against brute force attacks. The number of possible passwords for a given length and complexity consists of M^N where M is the number of possible characters allowed, and N is the length of the password. Calculating the number of possible passwords for up to a given character length is a recursive problem, but can be simplified by expressing it as a geometric series.
Recall, the sum of the first N terms of the geometric series of the form ar^k for 0,n-1 is a (1-r^n/1-r).
If we eliminate a by setting it to 1, set r to the number of possible characters, and let n = length + 1 we can easily calculate the number of possible passwords. For a password with up to 8 characters in length, and a character set of 26 we have:
1 ( 1-26^9/1-26)
-5429503678975 / -25
217,180,147,159 or 217 billion passwords.
lower alpha = 26 possible
UPPER ALPHA = 26 possible
numbers = 10 possible
Keyboard symbols = 32
For a 9 character length password with only lower case characters the number jumps to 5.6 trillion possible passwords. A single increase in character length from 8 to 9 increases the possible passwords 26 times because password growth is exponential.
The weakness of the verification process occurs in step 5. By only taking the first 5 bytes of H1 and padding it with 4 bytes of 0x0 to generate the initialization key for RC4 the length and complexity of the password becomes irrelevant and the possible keys is reduced to 40 bits or 2^40 possible keys. 2^40 only yields 1 trillion (1,099,511,627,776) keys making any password defeatable. Effectively this forces the possible key space into a size in between 8 and 9 character length passwords with only lowercase alpha characters. Note this allows the document to be decrypted, but does not recover the password itself.