开发者

letter frequency java

开发者 https://www.devze.com 2023-04-03 13:18 出处:网络
I am reading the book \"Cryptography and network security\" and i have been trying 开发者_如何学编程to write the program perform a letter frequency attack on a mono-alphabetic cipher. The program need

I am reading the book "Cryptography and network security" and i have been trying 开发者_如何学编程to write the program perform a letter frequency attack on a mono-alphabetic cipher. The program needs to produce say the top 10 possible plain texts.

I am a little stuck with how this could work, am i right in thinking that its not always the case that the "possible" plain texts produced will actually match the original plain text?

it would be great if anyone could provide some guidance to how the program would flow.

So far i have code that;

Reads a file of ciphertext. Scans the ciphertext and produces a hashmap of the letters matched to there frequency percentage. Have the relative frequency of the English language stored in a 2d array.

My next step was to try and sort the array in order of the nearest match to the char's percentage. Is this going in the right direction?

Any suggestions would be great!


I'm no expert in cryptography, but I think you are way oversimplifying. Yes, a useful tool for cryptographers is a table of relative frequencies of letters. But the probability that any given document will exactly match the overall frequencies is, I would think, pretty small. Like, in English the most frequent letters are, as I recall, E, T, A, O, N, R, I, S, H. Suppose in your enciphered text you find the 9 most frequent letters are A, B, C, D, E, F, G, and H. Does it automatically follow that A MUST map to E, B to T, C to A, etc? Surely not. Suppose this particular document is about installing a Xerox printer. Frequent occurrences of the word "Xerox" will likely make X come much higher than it would in an average document. Suppose this is the only unusual frequency, so now your most frequent letters are, say, E, T, X, A, O, N, R, I, and S. Assuming that A maps to E and B to T still works. But with X stuck in the middle of the sequence, from there on ALL the assumed mappings will be wrong.

I think the way you actually work to break a simple substitution cipher like this is to try one or two letters, then examine the results and see which are plausible. You also look for other clues, like letters that regularly occur together, or letters that usually come at the beginning or end of a word (assuming that the enciphered text preserves word breaks).

As a learning programming exercise this might be amusing. But as a serious cipher-breaking program ... it's not that simple.

Very Late Afterthought

It strikes me that this is an interesting example of the problems with AI. Sure, a computer could easily count letter frequencies and make a first guess about mapping. And a computer could easily compare the results to a dictionary and see how many real words show up.

But how would you distinguish good hits from false hits? To take a simple example, if I was doing this manually and after a first cut at the mapping I saw many occurrences of the word "toe", well it's possible that this document is talking about feet, but maybe the letter I mapped to "o" should really be "h" and this word is "the".

Or, I recall reading years ago of a cypher message intercepted during the American Civil War, where for some bizarre reason the person encrypting the message left the words "reaches you" unencrypted. The people who intercepted it thought, the words before that are likely "before this" or "by the time this". It turned out it was "before this", and that vital clue helped break the cypher.

A human cryptographer can often make good guesses based on intuition and context. If I'm reading a coded message about financial transactions and come across "_ank", I'd guess "bank". If the message is about military maneuvers, my first guess would be "tank". But if in the financial message the preceding words are "this stock will", then "tank" is more likely. And in the military message, "cross the river" would make it more likely to be "bank". Etc etc.

Programming a computer to think of all the things that a human would think of is very difficult. I read a book on AI years ago in which the writer said that the technical problems have mostly been solved, modern AI developers know how to program a computer to reproduce human thinking, except for the small problem that "the human experts we consult are often unable to express how they work in terms that can be programmed into a computer". I just laughed. We solved the problem. The only catch is that our solution doesn't actually work. For which we blame other people.


Theoretically you might get multiple possible valid English (?) outputs, but if your input text is non-trivial, there is almost certainly only one output that consists mainly of English words.

You could start with the most probable mapping, then count the number of English words that mapping produces by comparing the words in the output created by that mapping to a dictionary of English words. If there is a low amount of English words, try the next most probable mapping and so on.

Using an English dictionary as a control gives your algorithm a means to know that it is done.

You can improve the efficiency of the algorithm by using explicit knowledge of the language. For example, in English there are only two 1-letter words (I, a) and a small set of two-letter words. If the input text contains one or more short words, you can use them to either include or exclude possible mappings.


If it is mono-alphabetic you would be better off using brute force to rotate through the possible combinations. Since you are doing it as a learning exercise I will try to help with an approach. So IIRC the two most frequently found letters in the English language are E and T (that could be wrong). So what you want to do is take say the top 5 most frequent characters in English (once again an assumption here that it is English) and assign a weighted value to each of these. By doing this you can take the cipher text and record the frequency of each character A-Z and compare them to the top 5 characters and their weighted values. At the point that you have that much information it is fairly straight forward as to break the remainder of the ciphertext.

Further reading: attacking ciphers


Assuming this is not simply a shift cipher (in which case a brute force method on the first 10 characters say could quickly reveal the key and allow you to crack the cipher) - your best bet is to first use the frequency analysis to guess the three most common letters (E,T,A in standard English). You can then use this along with a further frequency analysis the most common pairs or triplets of characters. In particular, if you have correctly identified 'T' and 'E', then a regular occurrence of TXE suggests X might be H.

Programming this all automatically would be quite tricky and a manual approach might best. Alternatively a brute force approach on the first 6-10 characters to identify any sensible word from a dictionary would work. You could reduce the computations required by ignoring cipher alphabets whose probability fall below a certain threshold, given the cipher text.

0

精彩评论

暂无评论...
验证码 换一张
取 消