2
$\begingroup$

I'm working on a small web service that will require users to upload files. The file type is P12. When the file is uploaded and I need to save it on my server, I need to generate a random name for it. I can have the server generate a long random string of letters and numbers called a UUID. Here's a UUID I generated:

5025e8c423cad (so the file would be named 5025e8c423cad.p12) 

I read online that a UUID would have to be generated 1 billion times per second for the next 100 years in order to have a 50% chance of being generated again. I'm assuming that Facebook uses something similar, if not UUIDs, to name photos when people upload them. If it stayed around for a very long time, Facebook could meet the maximum (and photos would be overwritten if they did not do something).

This might be a stupid question, but if I changed the name from uuid.p12 to uuid-uuid.p12, what would be the probability of generating the same one after one hundred years? I was quick to think 25%, but after spending a minute thinking about it, I would assume the percentage would be much, much lower.

5025e8c423cad.5025ea6c54cd1.p12 
  • 1
    It's relevant to the answer what the size of the alphabet you use is. Most of the answers below assume you can use any letters, but I reckon you're only using a to f. Please clarify.2012-08-11

2 Answers 2

2

Suppose that there are $n$ IDs available, and we generate $m$ IDs at random. If $m$ is much smaller than $n$, then the probability of no duplicate is approximately $e^{-\frac{m^2}{2n}}.\tag{$1$}$ For an informal description of this result, and some references, see the Wikipedia article on the Birthday Problem.

In your case, you can compute the $n$ for your current ID scheme. To determine how many times we can randomly generate IDs so that the probability of no duplicate is about $50\%$, we find the $m$ that satisfies the equation $1-e^{-\frac{m^2}{2n}}=0.5.$

Suppose now that we double the length of IDs. The number $N$ of IDs available is now $n^2$. Let $M$ be the new number of times we can generate IDs with the probability of a duplicate equal to $50\%$. Then we want $\frac{m^2}{2n}=\frac{M^2}{2N}.$ Using $N=n^2$, we find that $M= m\sqrt{n}.$ So the number of times we can "safely" sample (if $50\%$ is safe!) is multiplied by the extremely large number $\sqrt{n}$. If you change the definition of "safe" to allow only a probability $0.001$ of a duplicate, the same relative calculations holds, one multiplies by $\sqrt{n}$.

To calculate the probability of no duplicate if we double the length of IDs, leaving $m$ alone, simply replace $n$ by $n^2$ in formula $(1)$. Thus the probability of a duplicate is $1-e^{-\frac{m^2}{2n^2}}$. But when $x$ is close to $0$, $e^x$ is approximately $1+x$, so the probability of a duplicate with the double ID scheme is approximately $\frac{m^2}{2n^2}.$ Since $\frac{m}{2n}$ is approximately $0.5$, we get the estimate $\dfrac{0.5}{n}$ for the new probability of no duplicate.

Remark: The formula we have used is only approximate, but it is a pretty good approximation. You may want to check what probability it predicts for the classical Birthday Problem, where $n=365$ and the "break even" $m$ is about $23$.

1

It looks like your UUIDs are strings of 13 symbols from 36 different symbols (26 letters plus 10 digits), so there are $36^{13}$ different possible UUIDs. Very roughly speaking, you need to generate about the square root of that to have an appreciable chance of getting a repeat, so about $6^{13}$. Does that come to a billion ($10^9$) per second for 100 years? I haven't done the arithmetic, but eyeballing it I'd say 100 years is an overestimate.

If you up it to strings of 26 symbols, there are $36^{26}$ strings, so about $6^{26}$ to get a repeat, which is $6^{13}\times6^{13}$, which is much bigger than $6^{13}$. Thus it would take much, much longer to get to the 50% level for a repeat, or, in terms of your question, much, much less that 50% chance of a repeat after 100 years.

  • 0
    Yes, thanks, I'll edit.2012-08-11