0
$\begingroup$

We are using character strings (out of 64 possible characters) to identify unique elements in a 'bag' of data. each bag has on average 20 elements, but up to 300.

I am trying to determine what size should that string be so that the probability of a collision (if we pick the characters randomly) is less than 1 in a 1,000,000 for 20 elements, and then for 300 elements.

2 Answers 2

1

As an approximate answer, if you have a bag of $20$ elements, there are $\frac {20\cdot 19}2=190$ pairs. If the probability of a match given one pair is $p$, the probability of no match in one pair is $(1-p)$ and the probability of no match in $20$ elements is $(1-p)^{190}$. For very small collision probabilities (like you want) $(1-p)^{190} \approx 1-190p$, so you need $190p \lt 10^{-6}$ or (rounding) $p \lt 5 \cdot 10^{-9}$. Then if your string is $n$ characters long, you need $64^{-n} \lt 5 \cdot 10^{-9}$ Again rounding down and taking base two logs, $-6n \lt -29$, so $n=5$ will work. For $300$, you have about $15^2=225$ times as many pairs, so need two more characters for $7$. With all the rounding down, $6$ might suffice, but how much does one more cost?

1

From Wikipedia ( http://en.wikipedia.org/wiki/Birthday_problem#Probability_table), a one in a million chance to collide is reached very quickly; with only 32 bits (a 4-byte integer value in most languages), 93 elements have a one in a million chance to collide. You can Base-64 encode this integer using your alphabet. With 64 bits (a "long" in most languages), the number of hashes at which there's a one-in-a-million chance increases to 6.1 million.