0
$\begingroup$

We are generating random numbers: 0 1 2 3 4 5 6 7 8 9

Question: How many samples do I need to assume that the chances were 10% per number?

Are there any formulas if we know that we are generating the following numbers: 0 1 2 3 4 5 6 7 8 9?

Ex.: I need to generate 1 trillion (how much exactly?) numbers to get 10%?

If the chances are 10% for every number, when will it be in the reality 10%?

Related to If we would have a perfect random decimal number generator, what would the chances be for the occurence of the numbers?

  • 0
    I suppose you also want to rule out something stupid like `next_number = (previous + 1) % 10`, which actually will hit the "exactly 10%" condition a lot more reliably and more often than a real random string of numbers would. You might look at http://stackoverflow.com/questions/2130621/how-to-test-a-random-generator . This is a question that computer scientists have written whole chapters of books to answer. Maybe even entire books.2017-01-13

1 Answers 1

1

If the process generates a uniformly random digit, the proportion of each digit will tend to $10\%$ as the sample size grows, by the central limit theorem, but the probability that a digit will be generated exactly $10\%$ of the time will actually tend to zero.

There is no possible way to check for randomness, because there are cryptographic PRNGs that can be used to generate non-random bit streams that cannot be statistically distinguished from truly random bit streams. However, you can ask for confidence bounds. If the stream of digits were truly uniformly random, for large sample size you can use the central limit theorem and approximate the confidence interval for the number of occurrences of a particular digit, given any confidence level. For example, you could compute that about $95\%$ of the time the digit '1' would occur within $2$ standard deviations from the mean. So if you observe the opposite then you can be $95\%$ confident that it is not random. Note that this applies when you pick only one digit to test. If you test all the digits there is a significant probability that one would lie beyond $2$ standard deviations from the mean. Also, passing the test does not give you $95\%$ confidence that the stream is random! All it tells you is that you didn't manage to distinguish it from $95\%$ of truly random digit streams.

It is also difficult to combine such weak tests because the more tests you have the more likely a truly uniformly random digit stream will fail at least one test. One way to make the test depend on all the digits is to choose the digit whose frequency deviates the most from the mean. It is not easy to compute exact confidence intervals because the frequencies are not independent, but I think the probability that the most deviant frequency does not exceed $3$ standard deviations from the mean would be about $0.997^9 \approx 0.97$ (because any $9$ frequencies are quite independent but completely determine the $10$th), and hence if you observe the opposite you can be roughly $97\%$ confident that the digit stream is not random.

Another kind of test would be to use compression algorithms (in this case from a decimal stream to a decimal stream). A truly random stream cannot be compressed at all on average. A good rule of thumb is that compression by at least $k$ bits has probability at most $2^{-k}$. This is of course not true because a compression algorithm might just compress one stream but bloat all others, however general-purpose compression algorithms would behave quite like the rule of thumb says. You can create multiple compression algorithms by incorporating various transforms (in your case you can try iterates of forward-difference modulo $10$). Suppose you try $n$ compression algorithms, and one of them succeeds in compressing by $k$ bits. According to the rule of thumb, the probability that that happens is at most $n2^{-k}$, so you can claim $(1-n2^{-k})$ confidence that the decimal stream is not random.