Last week, we had a heated discussion at work about encryption. We want to encrypt some data in our database, and I proposed that we go with a single private-key encryption mechanism (ignore which exact one for the moment), and my colleagues were pretty-much unanimously suggesting a ‘key per row’ approach. In this post I am going to attempt to explain the rough background, and why I felt their mechanism might not be best.
So my thought process is this: we need to encrypt a column of data, and we need to be able to decrypt it later because we have to have access to the real data value. With a single private key (assuming that it is stored securely somewhere) we encrypt the current data to a new column, and then after suitable tests drop the unencrypted column. As long as every area that needs access to the unencrypted data can call our code-base that will decrypt appropriately, then we will be fine.
My colleagues, in contrast, felt that having a single private key was an issue, and exposed the column to the risk of frequency analysis*. They proposed that a mechanism be developed to create a key based on some other field, perhaps mixed up with some other hidden value to create a key that was unique to a row, or perhaps a small set of rows.
My argument with this is based on one point, and only one. Does Kerckhoffs’ principle apply? That is, if everything about the system, except the key, is public knowledge, would it still be secure? They would still be allowed to keep elements of their algorithm secret (just as I am planning to keep my key secret)… but what would happen if they published ‘how we make our key’? This sounds crazy, yet my understanding is that this makes for a good test; if telling people how you make your key does not compromise your security; you are safe!
So in the interests of disclosure: I plan to use a key generator to give me my private key of n-bits length. Let’s imagine that we have an encryption algorithm that needs a 32-bit key. That is 2^32 or 4,294,967,296 possible key values.**
There were varied suggestions on how a key could be generate ‘per row’, so, and only for the purposes of demonstration I will give an example of what could be done:
- 2x ASCII characters from an identifier on the row (for the purposes of example I will consider that these will be digits only)…
- …plus ‘secret sauce’ to pad to length of the required 32-bits.
Now, where’s the harm in telling people how we have come up with these keys? It still has to be harder looking for multiple keys than a single one, right? Well, in my opinion: Wrong!
Here’s my key (where each bit is represented by a question-mark; the bars are simply to help distinguish bytes):
????????|????????|????????|????????
My colleagues are going to take two digits from some other field on the row (in this notional example), so their key will be:
nnnnnnnn|nnnnnnnn|????????|????????
(where nnnnnnn is an ascii digit).
If we knew this, what would we know about the key? Well, those digits are in the binary sequence 00110000 – 00111001. Suddenly, we know that all the keys looks like this:
0011aaaa|0011aaaa|????????|????????
(where aaaa is a number between 0 and 9 in binary, not 0 and 15 that are potential value ranges for 4 bits.
If my maths is right, this means the number of possible values for each individual key is: 2^16 * 10 * 10 (2 to the power of the number of bits we have no idea what they are, times 10 for each aaaa which have 10 possible values), or 6,553,600. That is a dramatically smaller number of key values to consider; it is approximately 655 times less work to do to try every conceivable key value, compared to the fully-secret key.
However, bear in mind that this is what we can ascertain about the keys when we consider them as a set; “All the keys in use conform to this pattern”. On the basis of an individual row, they will have ‘published’ the mechanism that they use to derive those first two characters. So if I look at an individual row, and I know the rules by which those characters are created (or guess them), I will know the following about the individual key (say I know the digits will be ‘2’ (00110010) and ‘3’ (00110011) for a particular row):
00110010|00110011|????????|????????
Wow, so now the only number of combinations for this particular key is 2^16 or 65,536… which of course is also 65,536 times less work to go through and check every possible value compared to my totally secret key. In some bizarre world where we can only test 1 key value per second, this now takes 18 hours to solve where my totally secret key takes approximately 1,193,046 hours (or about 136 years).
“But Wait!” you exclaim! After that 18 hours of key testing, you have only figured out how to decipher one row! That is correct inasmuch as I do indeed only have the specific key for that row, but now I know this about every key:
0011aaaa|0011aaaa|kkkkkkkk|kkkkkkkk
(where kkkkkkkk is the now-known secret-key-value).
So – I definitely know 75% of the key for every row, and I have also know how to get those aaaa‘s from some other data on the row. It is now trivial to decrypt every other row.
Bear in mind, too, that even if the precise method is not published, but something about it is known, then the cryptanalyst will look for other clues. So, imagine they know that the first two bytes of the key are taken from a customerId column; the nasty hacker is simply going to start on customers with many repeats in their identifier; e.g. 5555 or 6666 (it won;t matter if you take the 1st and 2nd, or 1st and 3rd digits from the ID, the result in the key will be the same. Perhaps the digits are compressed into 4-bit chunks, or maybe they are XORed with a secret value that is the full length of 32 bits. Whichever way, are you certain that your proposed mechanism does not open attack-vectors for your encrypted data?
But You Don’t Have to Publish!
The natural, and almost unavoidable, response to this line of thinking is to say: “But we don’t have to publish how we make the key!” and that is true; as I have stated, this whole post is working on the premise that it would be ideal if we could publish details of how we make the key and still have a secure system, and I hope I have demonstrated that in this case at least, that is not the case.
Nevertheless, if you use .Net then by default you are not far off publishing your code. Have you never heard of Reflector? Of course, there are ways to mitigate your code being readable (especially code obfuscation), but they are extra steps that many will not take as a rule.
Summary
I know I am not a security expert, and so I hope that nothing I have said here is taken too-much at face-value. What I do believe is that others have come up with ways to encrypt data that are practically unbreakable, but if one is not careful, you can introduce loopholes and short-cuts to finding the value of that key… and every bit of your key that someone can discount because they know it, or can calculate it from some other value halves the time to test the remaining combinations.
Finally, although I am relatively wary of writing anything here that might be seen as a recommendation, I believe that there may be ways to have a secure ‘key per row’. You’d have to do something like encrypt the customerId or similar data (using a single secret key), and then use that result as the encryption key for another encryption cycle on the other data column! Then that of course means you would have to do an encryption (to find the row-key from the customerId) followed by decryption (to extract the data you need) every time you needed it.
Notes
* Simple frequency analysis would certainly be possible if we used an ultra-basic encryption algorithm, but my understanding is that modern techniques essentially slice-and-dice the data in such a way that there are essentially no ‘characters’ any more to do analysis on; i.e. the slicing-and-dicing are done in different ways depending on the key. In truth, of course, I am sure there are still some attack vectors.
** Please bear in mind for the rest of this example, that although modern cryptography uses substantially more bits for the key, it is also human-nature to push the other way and want ‘more keys’. So, in this example, there are 100 possible keys in total, compared to my single key. However, once on the track of ‘more keys is better’ it is easy to see how someone might think that it would surely be better to have 1,000 keys, or 10,000… and surely it wouldn’t hurt to have the whole customer identifier in the key?! No! Hopefully this example, although contrived, demonstrates that the best intentions do not necessarily reflect in a more secure system, and can indeed seriously compromise the system you do set up.
One algorithim, salt per row.
2 stones and a bird, or something like that.