2
$\begingroup$

What is probability of having a String with random length with random characters ? For example I want to generate 100 Strings with random length k (between 1 to 100) and generate k numbers of characters (between 1-256 ASCII characters). What's the probability to get a duplicate (exact same string with same length and characters) ?

Here's the psuedocode

    for i=0 to 100
          int size = random_int_range(1,100)
          str = ""              

          for j=0 to size
              str += (char) random_int_range(1,256)    

random_int_range is uniformly distributed.

  • 0
    You need to tell us the distributions. Are they each uniform? Or are all strings equally likely? Are the $k$s the same for all 100 strings?2012-10-03
  • 0
    @Henry they are uniformly distributed, and different k for all 100 strings2012-10-03
  • 0
    @Henry I dont think he is thinking about pdfs, I would simply guess they are equally likely2012-10-03
  • 0
    Duplicate means equal length and equal content, right?2012-10-03
  • 0
    @Sasha that's correct2012-10-03
  • 0
    It's a little cumbersome to get an exact expression, but for practical purposes an approximation (via Poissonization) gives a probabily of $1- \approx \exp(-1/510) \approx 0.001958 \approx 2/1000$2012-10-03
  • 0
    Can you explain step by step by in words ? I think i can get it without exact expression.2012-10-03

1 Answers 1

1

Here goes an approximation.

We have $N=100$ strings of variable lengths, uniformly distributed in $1\cdots N$. Let $k_m$ be the number of strings of length $m$. We have $C^m$ ($C=256$) distinct strings, all equiprobable. And the probability of no collisions inside this set (given $k_m$) is , from the Birthday problem:

$$p_m = \frac{C^m !}{C^{m k_m} (C^m-k_m)!}$$

for $k_m < C^m$.

But $k_m$ is a random variable, and further $k_i, k_j$ are not independent ($\sum k_i = N$). We know, however, that $E(k_m)=1$. Lets introduce the approximation ("poissonization") of considering the $k_m$ as iid Poisson variables, so

$$P(k_m=k) = e^{-1} \frac{1}{k!}$$

Then the probability of no collision in the set $m$ is

$$ p_m \approx \sum_{k=0}^{C^m} e^{-1} \frac{1}{C^{m k}} {C^m \choose k} = e^{-1} \left(1 + \frac{1}{C^m}\right)^{C^m} $$

and the global probability is the product of the above for $m=1\cdots N$. In practice, only the first few terms have influence. One can even approximate further taking the first term only so that

$$p \approx e^{-1} (1 + 1/C)^{C} = 0.998052$$

or take logarithms and approximate even further:

$$\log(p) \approx - \frac{1}{2 C} = -\frac{1}{512} $$