2
$\begingroup$

I have a large set of data and a copy of that data. The whole data set is $n$ bytes. I want to be 99.999% certain that the sets are identical. Assuming that copying errors occur randomly, how many bytes do I need to randomly select and compare against the reference to be 99.999% certain the two sets are completely identical?

This problem, it appears to me, relates to that one here: Determining Sample Size for a Desired Margin of Error -- however I'm confused by the margin-of-error and confidence interval both occuring in the formula, but the sample size not being dependent at all on the input size (in that example, total number of students).

  • 0
    There is no such thing as 99.999% certain being identical. The sets are identical or they are not. You can determine a sample size of $n$ with 99.999% certainty that $x$ does not deviate more than so and so from $y$. That's where the margin of error comes in. So you do not have enough information to determine how many bytes you need, if the maximum allowed error is not addressed...2017-01-27
  • 0
    I'm not sure I understand, to be honest. So say I want to be 99.999% certain that the amount of difference between both sets is less than 0.2%, how would I calculate $n$ then? Is this enough information?2017-01-27

1 Answers 1

0

If I understand your question correctly, you want to treat your data as a set of Bernoulli random variables ($X=1$ if a given byte is correct), and determine if the "real" proportion of bytes that are correct is $p=1$. So you construct a one-sided confidence interval around $\widehat{p}=\frac{1}{N}\sum_{i=1}^Nx_i=1$ and choose $N$ such that (using the rule of 3) $1 - 3/N=1-\delta=0.000001$. See this post and this wiki page.

Note that your $n$ doesn't matter (unless $n

I will note though that this doesn't make much sense from an actual computer engineering standpoint, for actually copying data (as the comment above notes). You should use checksums or something :)