2
$\begingroup$

What is probability of having a String with random length with random characters ? For example I want to generate 100 Strings with random length k (between 1 to 100) and generate k numbers of characters (between 1-256 ASCII characters). What's the probability to get a duplicate (exact same string with same length and characters) ?

Here's the psuedocode

    for i=0 to 100           int size = random_int_range(1,100)           str = ""                          for j=0 to size               str += (char) random_int_range(1,256)     

random_int_range is uniformly distributed.

  • 0
    Can you explain step by step by in words ? I think i can get it without exact expression.2012-10-03

1 Answers 1

1

Here goes an approximation.

We have $N=100$ strings of variable lengths, uniformly distributed in $1\cdots N$. Let $k_m$ be the number of strings of length $m$. We have $C^m$ ($C=256$) distinct strings, all equiprobable. And the probability of no collisions inside this set (given $k_m$) is , from the Birthday problem:

$p_m = \frac{C^m !}{C^{m k_m} (C^m-k_m)!}$

for $k_m < C^m$.

But $k_m$ is a random variable, and further $k_i, k_j$ are not independent ($\sum k_i = N$). We know, however, that $E(k_m)=1$. Lets introduce the approximation ("poissonization") of considering the $k_m$ as iid Poisson variables, so

$P(k_m=k) = e^{-1} \frac{1}{k!}$

Then the probability of no collision in the set $m$ is

$ p_m \approx \sum_{k=0}^{C^m} e^{-1} \frac{1}{C^{m k}} {C^m \choose k} = e^{-1} \left(1 + \frac{1}{C^m}\right)^{C^m} $

and the global probability is the product of the above for $m=1\cdots N$. In practice, only the first few terms have influence. One can even approximate further taking the first term only so that

$p \approx e^{-1} (1 + 1/C)^{C} = 0.998052$

or take logarithms and approximate even further:

$\log(p) \approx - \frac{1}{2 C} = -\frac{1}{512} $