1
$\begingroup$

A set of 1000 data items is stored redundantly in a database with 3 copies existing (therefore there are 3000 entries in the database). During a break-in, 100 random data entries are maliciously modified by inverting the letters and numbers. What is the likelihood that the retrieval attempts for the first 50 data items (each retrieval attempt retrieving all copies of a data item) result in three uncorrupted data copies (i.e. in the situation where all three copies have not been modified) ?

I have identified this as Hypergeometric distribution.
Total entries= 3000
corrupted items =100
Uncorrupted data entries = 2900
P(none of them have 3 uncorrupted data copies) = (2900 c 150) / (3000 c 150)

Is my approach correct ? If wrong please help me find the solution to the problem.

  • 0
    Thank you for your careful preliminary analysis before asking your Question.2017-02-18

1 Answers 1

0

Using your hypergeometric model you have 100 corrupt items, 2900 good ones, you sample 150, and want the probability of $X = 0$ corrupt items among the 150. As you say, that would be

$$P(X = 0) = \frac{{100 \choose 0}{2900 \choose 150}}{{3000 \choose 150}} = \frac{{2900 \choose 150}}{{3000 \choose 150}} \approx 0.0054.$$

Evaluating the binomial coefficients is a difficulty because factorials of large numbers are involved. In R statistical software care was taken in programming the hypergeometric PDF dhyper to avoid overflowing the capacity of the computer arithmetic, when possible. (The second method below finds the log probability and then exponentiates that, just for additional safety.)

dhyper(0, 100, 2900, 150)
## 0.005417104
exp(dhyper(0, 100, 2900, 150, log=T))
## 0.005417104

But if you try to compute the binomial coefficients directly, you get overflows indicated as Inf for both numerator and denominator. I would be a rare hand calculator that could handle binomial coefficients with such large numbers.

choose(2900, 250)
## Inf
choose(3000, 250)
## Inf

One might try an approximate Poisson model. The contamination rate for one record is $\lambda_1 = 100/3000$ and the contamination rate for 150 records is $\lambda_{150} = 150\lambda_1 = 5.$ Then, letting $Y \sim \mathsf{Pois}(\lambda_{150}=5.3),$ you have $P(Y = 0) = e^{-5} \approx 0.0067,$ which is not horribly far from the hypergeometric result.

The approximation works best when 'sampling with replacement' would rarely result in contaminating a record more than once.

Here is a plot of hypergeometric probabilties for contamination counts $0$ through $15,$ along with Poisson approximations. Out of 150 records it is unlikely to see more than a dozen corrupted ones.

enter image description here

Notes: (1) Suppose two copies of a single record agree and the third differs. Then you might be willing to let the majority rule. It seems unlikely that two copies of the same record would be randomly chosen for contamination, and even if so, that they would have been randomly contaminated in exactly the same way. This raises the possibility of a slightly more complicated model, which you might choose to explore.

(2) The computation with dhyper in R is something like: $\frac{2850}{3000} \times \frac{2849}{2999} \times \cdots \times \frac{2751}{2910},$ a method called "zippering." In R, prod((2850:2751)/(3000:2901)) returns 0.005417104.

(3) For use in future posts on this site, in MathJax you can write ${a \choose b}$ by typing $(a \choose b)$.