0
$\begingroup$

I need to identify which text is in English from a list of possible list of plain texts generated from a brute force attack on a cipher text. I am thinking of using frequency distributions of English letters such as this one from Wikipedia. I can easily generate the frequency distributions of my plain texts. Now I need to compare each of the frequency distributions from my possible plain texts to the ideal frequency distribution. How do I determine which one is "closest" to the ideal?

  • 0
    See http://en.wikipedia.org/wiki/Divergence_%28statistics%29, http://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence, http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence, http://en.wikipedia.org/wiki/R%C3%A9nyi_entropy#R.C3.A9nyi_divergence.2012-08-09
  • 0
    @Code-Guru are you satisfied with the given answers?2012-08-09
  • 0
    @SeyhmusGüngören I haven't decided yet. I'm still attempting to implement them in code. I will certainly accept one when I decide ;-)2012-08-10
  • 0
    May be you are right. Sorry at any case.2012-08-10

4 Answers 4

4

You can define a metric and check which one is closest.

If $f(i)$ is actual distribution for your letters, where $0 \leq i \leq 25$ and $a_n(i)$ is the distribution in the nth text, you can define

$$d_1(f, a_n)= \sum_{i=0}^{25} |f(i)-a(i) | \,.$$

For each letter then, the the difference between the "plain English" distribution of that letter and the "chipper text" distribution of the letter is added to your distance. Small distance means that the two are close to eachother.

Most of the times for problems of measuring how far two things are appart, people preffer to work with the metric

$$d_2(f, a_n)= \sqrt{\sum_{i=0}^{25} |f(i)-a(i) |^2} \,.$$

You can also use

$$d_{max}(f, a_n)= \max_{0 \leq i \leq 25} |f(i)-a(i) | \,.$$

I think $d_1$ is the most appropriate though.

  • 0
    These all look familiar from my Numerical Linear Algebra course. I think this is what I'm looking for. At least they are simple enough to code. If I don't get the results I'm looking for then I can move on to more complex methods.2012-08-09
  • 0
    @Code-Guru: It shouldn't matter at all which metric you use for this purpose. The incorrect texts will look *nothing like* the correct distribution. Harder is to determine a threshold where if you haven't tried all the possible keys you conclude that the best match is or is not good. One approach would be to take a number of English texts from different sources and see what the distance distribution looks like.2012-08-09
  • 0
    @RossMillikan Thanks for the additional information.2012-08-10
3

Just a remark (answer since I do not have right to comment) that languages are usually identified by frequencies of n-grams (letter sequences, usually trigrams or 4-grams), not single letters.

  • 0
    Thanks for the input. I'm familiar with n-grams. This question was prompted by a problem from [Project Euler](http://www.projecteuler.net). I'm keeping it as simple as possible. If frequencies for single letters don't help me find a solution, then I'll move on to longer n-grams. The problem isn't really differentiating between different languages, but rather I need to identify which of $26^3$ decrypted texts is closest to English.2012-08-13
  • 0
    p.s. You're a few steps closer to commenting priveleges ;-)2012-08-13
1

First of all there exist such lists for (I think) all languages. I saw English and German already.

Coming back to your question, there are many ways to do this. There are quite many metrics defined for this job. One of the most known metric for information theoretical guys is the mutual information. It can be translated as relative entropy between two densities which is also known as Kullback-Leibler divergence.

I suggest you to have a look at:

http://de.wikipedia.org/wiki/Kullback-Leibler-Divergenz

  • 0
    Or for those of us who can't read German: http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence2012-08-09
  • 0
    :) true! you re right.2012-08-09
  • 0
    I was simply providing an English translation for your link. I definitely appreciate you pointing me in the right direction. I just had to find a version of the information that I can actually understand.2012-08-10
  • 0
    ok. of course. I posted it suddenly even without checking what language was it.2012-08-10
  • 0
    No problem! Thanks a lot for your help (and the link).2012-08-10
1

I would like to elaborate on the finite sample size problem. You will have finite number of samples and your histogram will not model the true density well enough. Actually the density which is obtained for the true distribution is not even the true density but it can be accepted as the true density.

In such a finite sample size problem there might be really big deviations from the true model. In this case, so called "features" are used to be as robust as possible.

For the true density $f_X (x)$ you obtain $g(f_x (x))$ such that $g(\cdot)$ is a feature extracting function. On the other hand you can use $g(\bf x\rm )$. Each will have different properties but they will be definitely robust as they are known as non parametric solutions.

Finally you use a metric again between features that you extracted from the true density and your estimated histogram. I can give one example for such a group of features which are extracted directly from the density. It is the "moments" of the characteristic function. From the data itself, "cepstrum" could be a good feature. Good features depend on your own problem. Accordingly extracting many such features and finding the minimum distance based on these features will make your system more reliable against both finite sample size as well as outliers.

I hope this helps...