9
$\begingroup$

I have a set of numbers where I am randomly and independently selecting elements within a set . After a number of these random element selections I want to know the coverage of the elements in the set. Coverage being how many elements from the set have been selected at least once divided by the total number of elements in the set.

To restate this: what is the probability distribution of the different coverage values on a set after $X$ randomly, independently selected elements of the set?

3 Answers 3

12

If there are $n$ elements of the set then the probability that $M=m$ have been selected after a sample of $x$ (with replacement) is

$$\frac{S_2(x,m) \; n!}{n^x \; (n-m)!} $$

where $S_2(x,m)$ is a Stirling number of the second kind.

The expected value of $M$ is: $n \left(1- \left(1-\dfrac{1}{n}\right)^x \right)$.

The variance is: $n\left(1-\dfrac{1}{n}\right)^x + n^2 \left(1-\dfrac{1}{n}\right)\left(1-\dfrac{2}{n}\right)^x - n^2\left(1-\dfrac{1}{n}\right)^{2x}. $

  • 1
    I think you need to swap $m$ and $x$ in your expression for the probability. Does that affect the expected value and variance calculations, too?2011-04-13
  • 0
    @Mike Spivey: you are right about the probability. $m$ should not appear in the expressions for mean and variance2011-04-13
  • 0
    @Henry This is a nice solution!2011-04-13
  • 0
    I cannot validate the solution, but seeing as how Mike is a stone's throw away from my hometown of Enumclaw and he votes yes on the answer, I'm going to have accept the answer.2011-04-13
  • 5
    @Ross: The argument Henry is using (I think) is as follows: The number of ways to choose which $m$ elements are to be covered is $\binom{n}{m}$. Then the number of ways to have the $x$ elements in the sample chosen only from those $m$ elements is the same as the number of ways to distribute $x$ elements into $m$ distinguishable nonempty subsets; i.e., $m! S(x,m)$, which is a Stirling number of the second kind. Then the probability is obtained by dividing by the number of ways to choose $x$ elements with replacement from $n$ elements, which is $n^x$. The factor of $m!$ cancels.2011-04-13
  • 1
    @Ross: There are a handful of us from the Pacific NW on this site. :)2011-04-13
  • 0
    I am probably severely misunderstanding something here, but if I set `x = m = n` then don’t I compute a probability ~1? Does that make any sense? Why would the probability of choosing 32 unique items from a set of 32 unique items when 32 items are chosen with replacement, be approximately equal to 1? I’m using the approximation `S2(x,m) ≈ m^x/m!`.2018-07-19
  • 1
    @ShelbyMooreIII If $x=m=n$ then you get $\frac{S_2(n,n) \; n!}{n^n \; (n-n)!}=\frac{n!}{n^n}$ as the probability that you select the $n$ different values in the first $n$ attempts. A direct calculation would give $\frac{n}{n} \times \frac{n-1}{n} \times \cdots \times \frac{2}{n} \times \frac{1}{n}$. This is not $1$ for $n\gt 1$: e.g. for $n=2$ it is $0.5$ and for $n=6$ it is about $0.0154321$2018-07-19
  • 0
    Ah I see my error was not checking that the [asymptotic approximation](https://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind#Asymptotic_approximation) is as _x_ is much greater than _m_. I was coming from [an answer](https://math.stackexchange.com/a/1957820/570197) based on your answer and failed to check the definition of the approximation at Wikipedia. Apology for wasting your time.2018-07-19
5

The expected proportion of elements covered, $E\left(\frac{m}{n}\right)$, has a simple limiting form as $n \rightarrow \infty$ with the sampling rate $ r / n $ fixed. Note that $\lim_{n \rightarrow \infty} \left(1-\frac{1}{n}\right)^n = e^{-1}$, and rewrite:

$$\lim_{n \rightarrow \infty} E\left(\frac{m}{n}\right) = 1 - e^{-\frac{r}{n}}$$

so that for example sampling $r=n$ times is expected to cover about 63% of the set. This is a reasonable approximation even for $n > 100$.

0

Derivation of $\operatorname E [M]$ with a classic use of indicator variables and total expectation:

We sample from set $X = \{1, \dots, n \}$. Let $X_i$ be $1$ if member $i$ is in our sample, $0$ otherwise.

For a sample of size $x$, the probability that none of the values are $i$ is $\left(\frac{n-1}{n}\right)^x$. Thus the probability that the sample includes $i$, and therefore $X_i = 1$, is $1-\left(\frac{n-1}{n}\right)^x$.

We have $$\operatorname E[M] = \operatorname E[X_1 + \dots + X_n] = \operatorname E[X_1] + \dots + \operatorname E[X_n] = n \left(1-\left(\frac{n-1}{n}\right)^x\right)$$

Variance is calculated similarly, but we have to consider separately $\operatorname E[X_i^2] $ and $\operatorname E[X_i X_j]$ for $ i \ne j$.