I am trying to implement a solution (working code) for the 4.1 paragraph in this paper.
The problem:
We have words with lengths for instance: $l_1$ = 1, $l_2$ = 2, $l_3$ = 3, $l_4$ = 8 and $l_5$ = 7.
These words will be part of the white-list.
We calculate the sample mean and the variance of the lengths of these words.
$\mu = \frac{1}{N}\sum_{i = 1}^N X_i$
So, $\mu = 4.2$ in our case.
Next step is to calculate the variance.
$\sigma^2 = \frac{1}{N}\sum_{i = 1}^N (X_i - \mu)^2$
So, $\sigma^2 = 7.76$
After all calculations are done we get another list of words and the goal of the algorithm is to assess the anomaly of a string with length l, by calculating the ''distance'' of the length l from the mean $\mu$ of value l of the length distribution.
This distance is expressed with the help of the Chebyshev inequality.
$p(\mid x-\mu \mid > t) < \frac{\mu^2}{t^2}$
When l is far away from $\mu$, considering the variance of the length distribution, then the probability of any (legitimate) string x having a greater length than l should be small. Thus, to obtain a quantitative measure of the distance between a string of length l and the mean $\mu$ of the length distribution, we substitute t with the difference between $\mu$ and l.
$p(\mid x-\mu \mid > \mid l-\mu \mid) < p(l)=\frac{\sigma^2}{(l-\mu)^2}$
Having the information above, if I run it with the next numbers: 1, 5, 10. I get these probabilities:
p(1) =0.757
p(5) =12.125
p(10) =0.230
Which I don't understand why some probabilities I get are bigger than 1, they are not supposed to be bigger than 1. I am trying to understand if the formulas described above are correct or maybe I am using them wrong.
Thank you.