2
$\begingroup$

This is probably a simple question to most of you but I wasn't seeing a clear solution. I was programming the other day and in one part there is a batch insert into a database of about 30 items at a time. If one member of the batch has an error then the whole batch will fail. I found that about 42% of the batches were indeed failing and wondered how I could use this to detect the actual error rate in the dataset.

I fixed the technical aspects of the problem I was facing but I was wondering how to solve problems like this in general. Abstracted a little it is: if you have sets of x elements (taken at random from the whole dataset) and you know the percent of these sets that have at least one element with a certain property, how can you find roughly the percent of elements in the whole dataset that have that property?

1 Answers 1

1

Under certain assumptions, which may or may not be realistic, we can solve the problem. Assume independence, that is, that success/failure of elements of the database are independent events, like the results of tossing a fair die. Let $p$ be the (unknown) probability that one individual item fails.

Suppose that the probability that a batch of $k$ items fails is $f$. We express $f$ in terms of $p$, and then solve for $p$.

The probability $f$ that the batch fails is $1$ minus the probability that all the items in the batch are OK. So $f=1-(1-p)^k.$ We solve for $p$. Manipulation gives $1-f=(1-p)^k.$ Take the logarithms of both sides, to any base you like. We will use base $10$. We obtain $\log(1-f)=k\log(1-p),$ and therefore $\log(1-p)=\frac{\log(1-f)}{k}.$ Raise $10$ to the power of each side. Then $10^{\log(1-p)}=1-p=10^{\frac{\log(1-f)}{k}}.$ It follows that $p=1-10^{\frac{\log(1-f)}{k}}.$

Equivalently, we could use the natural logarithm $\ln$. If we do that, we get $p=1-e^{\frac{\ln(1-f)}{k}}.$

With $f=0.42$, and $k=30$, I get that $p$ is approximately $0.018$.

  • 0
    thanks. I had a feeling it wasn't too hard but thanks for showing me anyway.2012-02-20