I'm trying to write a program to compute a metric for the entropy in files to determine a probability that the file is compressed or encrypted. Compressed and encrypted files have very, very, very high entropy compared to most other types of files.
I need a figure between 0 and 1. Zero being not random at all and 1 being totally random. I don't expect any input will ever give 0 or 1, though.
I have a paper here which describes a simple runs test. We have a sequence of $n$ values.
Starting at the first value, we write a "+" if the following value is greater, or a "-" if the following value is smaller.
Consecutive +'s and -'s are grouped into runs, and we can say we have $r$ runs.
We say $E(r) = \frac{2n - 1}{3}$ and $Var(r) = \frac{16r - 29}{90}$
(Don't ask me where these come from - I have no idea. Apparently for $n \gt 20$ we can say that $r$ is reasonably approximated by a normal distribution. The document doesn't explain. If you know, I'd love to know too!).
We can get a $Z$ value which is $\frac{r - E(r)}{\sqrt{Var(r)}}$
If we do a significance test at, say, $\alpha = 0.05$ then we can say something like $-\alpha \lt Z \lt \alpha$ and hence we can/cannot reject the sequence as being random given level of significance.
First of all: Is this runs test suitable to determine the entropy of a file? Secondly, how do I convert the $Z$ value to a figure between 0 and 1? My first guess was to put it directly into the CDF of a standard normal but I was getting results which didn't seem to match the test data I was using. Things that should have been very random were giving very, very low figures of entropy.