11
$\begingroup$

I have a set of numbers where I am randomly and independently selecting elements within a set . After a number of these random element selections I want to know the coverage of the elements in the set. Coverage being how many elements from the set have been selected at least once divided by the total number of elements in the set.

To restate this: what is the probability distribution of the different coverage values on a set after $X$ randomly, independently selected elements of the set?

3 Answers 3

13

If there are $n$ elements of the set then the probability that $M=m$ have been selected after a sample of $x$ (with replacement) is

$\frac{S_2(x,m) \; n!}{n^x \; (n-m)!} $

where $S_2(x,m)$ is a Stirling number of the second kind.

The expected value of $M$ is: $n \left(1- \left(1-\dfrac{1}{n}\right)^x \right)$.

The variance is: $n\left(1-\dfrac{1}{n}\right)^x + n^2 \left(1-\dfrac{1}{n}\right)\left(1-\dfrac{2}{n}\right)^x - n^2\left(1-\dfrac{1}{n}\right)^{2x}. $

  • 0
    Ah I see my error was not checking that the [asymptotic approximation](https://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind#Asymptotic_approximation) is as _x_ is much greater than _m_. I was coming from [an answer](https://math.stackexchange.com/a/1957820/570197) based on your answer and failed to check the definition of the approximation at Wikipedia. Apology for wasting your time.2018-07-19
5

The expected proportion of elements covered, $E\left(\frac{m}{n}\right)$, has a simple limiting form as $n \rightarrow \infty$ with the sampling rate $ r / n $ fixed. Note that $\lim_{n \rightarrow \infty} \left(1-\frac{1}{n}\right)^n = e^{-1}$, and rewrite:

$\lim_{n \rightarrow \infty} E\left(\frac{m}{n}\right) = 1 - e^{-\frac{r}{n}}$

so that for example sampling $r=n$ times is expected to cover about 63% of the set. This is a reasonable approximation even for $n > 100$.

1

Derivation of $\operatorname E [M]$ with a classic use of indicator variables and total expectation:

We sample from set $X = \{1, \dots, n \}$. Let $X_i$ be $1$ if member $i$ is in our sample, $0$ otherwise.

For a sample of size $x$, the probability that none of the values are $i$ is $\left(\frac{n-1}{n}\right)^x$. Thus the probability that the sample includes $i$, and therefore $X_i = 1$, is $1-\left(\frac{n-1}{n}\right)^x$.

We have $\operatorname E[M] = \operatorname E[X_1 + \dots + X_n] = \operatorname E[X_1] + \dots + \operatorname E[X_n] = n \left(1-\left(\frac{n-1}{n}\right)^x\right)$

Variance is calculated similarly, but we have to consider separately $\operatorname E[X_i^2] $ and $\operatorname E[X_i X_j]$ for $ i \ne j$.