1
$\begingroup$

For fun (relatively speaking), I'd like to do a little research on probabilities related to a pseudo-random number generator.

Specifically, how would I get started figuring out "what is the probability that a prng will generate the same random number twice?"

This is part of an investigation into a larger issue, but I'm not sure where to begin (or if my question is even relevant).

Edited to add:

After reading the comments and answers I see that my question isn't clear. As an example: the rng outputs a 5 What is the probability that another 5 will be the next output? How many "rolls" of the RNG must I make before I see another 5?

If we assume a good rng has uniform distribution, then is it 1 out of (2^32) (assuming 32bit integers)?

For a larger example, lets say I use the RNG to generate numbers that are used as an index into an array of characters. The array is 25 numbers long, so the output of the RNG is mod 25 I will generate a sequence of 5 characters.

How would I determine the probability of generating the same 5 character sequence?

If we assume that we have a uniform distribution of output numbers from the RNG, then the RNG's output shouldn't factor into the probability of generating the same character sequence.

However, is my assumption valid?

  • 0
    @J.M. I've updated the original question. What isn't clear to me, is how the period relates to the uniform distribution of output.2011-05-14

3 Answers 3

5

All pseudorandom number generators are periodic; you can thus expect any number you've seen to appear again if you do Monte Carlo long enough. What you really want to ask, I think, is how long their periods are. To list the ones I'm most familiar with, linear congruential generators (often used in calculators) typically have (very!) short periods (though two or more of these can be combined to produce a generator with a slightly longer period), the Marsaglia-Zaman generator has a period of $\approx 5\times 10^{171}$, and the now-popular Mersenne twister by Matsumoto and Nishimura has a period of $2^{19937} - 1\approx4\times 10^{6001}$. One necessary condition for a quality PRNG is a long period length (but this isn't a sufficient condition, as it should also pass a bunch of stringent statistical tests). Have a look at these two articles for more information.

  • 0
    "In a good PRNG, the same numbe$r$ will occur twice at lots of irregular intervals, and this won't have anything to do with periodicity for a very long time." I think this is what I'm interested in.2011-05-14
3

This is an answer to the edited question, taking into account your comment that you're interested in "In a good PRNG, the same number will occur twice at lots of irregular intervals, and this won't have anything to do with periodicity for a very long time."

In a deterministic machine with finite state space, the output of every computation is necessarily finite or periodic. However, as J.M. has pointed out, there are good pseudo-random number generators with very long periods. For practical purposes, we can assume that a good PRNG has a uniform distribution, as you do in your new question. But then the question is no longer really about PRNGs; it's about random numbers. What you're asking for now is just the probability and (presumably expected) waiting time for drawing a certain number from a uniform distribution.

If the numbers are integers uniformly distributed between $0$ and $2^n-1$ (where $n$ is the number of bits), then the probability of a particular one of them occurring is $1$ in $2^n$. (You wrote $2^n-1$; there's no reason to subtract one; all of the $2^n$ numbers are equally probable.)

The expected waiting time $t$ has to satisfy the equation $t = p \cdot 1 + (1 - p) \cdot (1 + t)$, where $p$ is the probability $p=2^{-n}$ of getting the number you're waiting for. This is because with probability $p$ you have to wait just once, and with probability $1-p$ you wait once and then end up in the same situation, having to wait for the expected waiting time. Solving for $t$ yields $t=p^{-1}=2^n$. So for $32$-bit numbers you have to draw $2^{32}$ numbers on average to draw a specific number, and in particular to draw a specific number again that you'd already drawn.

  • 0
    Okay, I think I understand. Given that good prng's, such as the ones mentioned by @J. M. have a sufficiently long period (larger than the output space of either 32bit or 64bit numbers), and pass statistically random tests, that the output is uniform. Thus, the characteristics of the RNG (mainly it not being truely random) shouldn't have any impact on any probability calculations I might want to make.2011-05-14
3

Others have answered your question well, so this answer will be directed more to cryptographic security of pseudorandom random number generators.

The only totally secure encryption scheme is the one-time pad, which requires pre-sharing random keys of arbitrary length. In practice this is rarely done, as pre-sharing is not computationally feasible or potentially not possible at all. However, if both parties can use a cryptographically secure pseudorandom number generator that somehow started at the same point, then they could create a stream cipher that would be a fairly good approximation to the one-time pad. Note that the deterministic nature of the PRNG makes it possible for it to be used in this way. If it were nondeterministic the receiver would be unable to recreate the keystream, and thus would be unable to decrypt and recover the message.

How do we make sure that a PRNG is "sufficiently random" to be secure? It is not enough for the probability of each outcome to be equiprobable. We need it to have forward secrecy, meaning that if someone reads an arbitrary length of the output they will have no advantage in predicting the next outcome. The most common way of measuring this is to use Claude Shannon's definition of entropy. The basic definition of entropy is $J = - \sum_{i=1}^k \, p(x_i) \cdot \log_2 p(x_i).$ where the $x_i$ are the potential outcomes and $p(x_i)$ is the probability of that outcome occurring. If the PRNG is uniformly distributed, then we have $J = \log_2 k$. But we desire to test for more than just the randomness of each individual output, so we look at nth order entropy: $J_n = \frac{-1}{n} \sum_{i=1}^k \, p(x_i) \cdot \log_2 p(x_i)$ where the $x_i$ are now a sequence of $n$ outputs concatenated together. With a PRNG, many of the outputs will never happen, so many of the probabilities will be $0$. Then finally, we can define the absolute minimum entropy $E = \min_{1 \leq n \leq \infty} J_n$ We should note that for a PRNG, which is periodic, the absolute minimum entropy must be $0$ (why?). But, by looking at approximations to the absolute minimum entropy using large values of $n$, we can get a better idea of the probabilistic nature of the PRNG. This is something you could potentially compute for a RNG of smaller period.

In practice, a massive amount of tests are done on a potential CSPRNG, including Knuth's algorithms in The Art of Computer Programming Vol.2, the DIEHARD tests (mentioned by J.M.), the Crypt-XS package, standards from the Handbook of Applied Cryptography, and the powerful NIST Statistical Test Suite. The latter contains several of the tests from each of the others, in addition to many of its own. To elaborate on some of the subtler points mentioned throughout the answer, the process of getting a CSPRNG to "start at the same place" is done by seeding. A seed is a relatively short input to a PRNG that, in a very loose sense, gives it a starting position. The seed should ideally be truly random, which in practice is best approximated by hardware random number generators, and can be shared via some encryption scheme between the two parties due to its small length. And finally, randomness can be deterministically increased from insufficient sources of entropy. While the built-in random functions in most programming languages are not sufficiently random, several outputs can be concatenated together and hashed with a cryptographic hash function such as one of the SHA family. This increases the first order entropy, and if concatenations of different (random) length are used, then higher order entropy also increases. This forms the basic idea of randomness extractors.

  • 0
    @J.M.: Yes, Blum Blum Shub is common in cryptographic applications. Yarrow is still in use quite a bit, but should be replaced by its stronger counterpart Fortuna. Both of which were designed in part by Bruce Schneier.2011-05-14