0
$\begingroup$

I need to identify which text is in English from a list of possible list of plain texts generated from a brute force attack on a cipher text. I am thinking of using frequency distributions of English letters such as this one from Wikipedia. I can easily generate the frequency distributions of my plain texts. Now I need to compare each of the frequency distributions from my possible plain texts to the ideal frequency distribution. How do I determine which one is "closest" to the ideal?

  • 0
    May be you are right. Sorry at any case.2012-08-10

4 Answers 4

4

You can define a metric and check which one is closest.

If $f(i)$ is actual distribution for your letters, where $0 \leq i \leq 25$ and $a_n(i)$ is the distribution in the nth text, you can define

$d_1(f, a_n)= \sum_{i=0}^{25} |f(i)-a(i) | \,.$

For each letter then, the the difference between the "plain English" distribution of that letter and the "chipper text" distribution of the letter is added to your distance. Small distance means that the two are close to eachother.

Most of the times for problems of measuring how far two things are appart, people preffer to work with the metric

$d_2(f, a_n)= \sqrt{\sum_{i=0}^{25} |f(i)-a(i) |^2} \,.$

You can also use

$d_{max}(f, a_n)= \max_{0 \leq i \leq 25} |f(i)-a(i) | \,.$

I think $d_1$ is the most appropriate though.

  • 0
    @RossMillikan Thanks for the additional information.2012-08-10
3

Just a remark (answer since I do not have right to comment) that languages are usually identified by frequencies of n-grams (letter sequences, usually trigrams or 4-grams), not single letters.

  • 0
    p.s. You're a few steps closer to commenting priveleges ;-)2012-08-13
1

First of all there exist such lists for (I think) all languages. I saw English and German already.

Coming back to your question, there are many ways to do this. There are quite many metrics defined for this job. One of the most known metric for information theoretical guys is the mutual information. It can be translated as relative entropy between two densities which is also known as Kullback-Leibler divergence.

I suggest you to have a look at:

http://de.wikipedia.org/wiki/Kullback-Leibler-Divergenz

  • 0
    No problem! Thanks a lot for your help (and the link).2012-08-10
1

I would like to elaborate on the finite sample size problem. You will have finite number of samples and your histogram will not model the true density well enough. Actually the density which is obtained for the true distribution is not even the true density but it can be accepted as the true density.

In such a finite sample size problem there might be really big deviations from the true model. In this case, so called "features" are used to be as robust as possible.

For the true density $f_X (x)$ you obtain $g(f_x (x))$ such that $g(\cdot)$ is a feature extracting function. On the other hand you can use $g(\bf x\rm )$. Each will have different properties but they will be definitely robust as they are known as non parametric solutions.

Finally you use a metric again between features that you extracted from the true density and your estimated histogram. I can give one example for such a group of features which are extracted directly from the density. It is the "moments" of the characteristic function. From the data itself, "cepstrum" could be a good feature. Good features depend on your own problem. Accordingly extracting many such features and finding the minimum distance based on these features will make your system more reliable against both finite sample size as well as outliers.

I hope this helps...