2
$\begingroup$

I want to sample a combination of $N$ elements (without replacement) from a list of $M$ elements where $M\gg N$. There are algorithms to do this when each element is picked with uniform probability. I want to do the same for the non-uniform case.

Let each specific element $i$ is associated with a positive constant $p_i$. Then, say for $N=2$, I want the probability of sampling the combination $\{i,j\}$ to be equal to $\frac{p_i+p_j}{Z}$, where $Z$ is the normalization constant, i.e. $Z=\sum_{k,l}p_k+p_l\;\;\forall k,l$.

Any hints, pointers for an efficient and unbiased sampling algorithm are much appreciated.

  • 1
    Why would the probability of choosing the set $\{i, j\}$ be proportional to $p_i + p_j$? Or are you stipulating that it should be? The usual method would be to keep drawing from the non-uniform distribution until you have drawn $N$ distinct elements. This doesn't yield probabilities proportional to the sums. For instance, consider the case where $p_1$ is about $0.9$, $p_2$ is about $0.09$, and the remaining probabilities are all $\ll 0.01$. Then you're much more likely to draw $\{1,2\}$ than anything else, despite the fact that $p_1+p_2$ is only a few percent greater than $p_1+p_3$.2012-07-27
  • 0
    Yes, I want probabilities proportional to the sums.2012-07-27
  • 0
    Are you sampling "with replacement", i.e. so that the same element might get picked more than once?2012-07-27
  • 0
    No, without replacement.2012-07-27
  • 0
    Note that, if that's the case, then $p_i$ aren't really probabilities, just weights, except in the case $N=1$2012-07-27
  • 0
    That is right. They don't need to be probabilities.2012-07-27
  • 0
    The question is still a little unclear, then. If the probability of choosing $\{i,j\}$ is to be proportional to $p_i + p_j$, then the probability of picking $i$ (i.e., of picking $\{i,j\}$ for some $j\neq i$) is not, in fact, equal to $p_i$.2012-07-27
  • 0
    OK, I slightly changed the wording of the question. I hope it removes the confusion.2012-07-27

2 Answers 2

2

In general, you want to pick $M$ distinct elements where the probability of result $A$ (a subset of $\{1,\ldots,N\}$ is $p(A) = \sum_{i \in A} p_i/Z$, $Z = \sum_A \sum_{i \in A} p_i = {N-1 \choose M-1} \sum_i p_i$. Let $S(A) = \sum_{i \in A} p_i$ and $S = \sum_{i=1}^N p_i$. Let ${\cal P}_k(B)$ denote the collection of all $k$-element subsets of $B$.

The probability of choosing item number $1$ is $$\eqalign{P(1) &= \sum_{A \in {\cal P}_{M-1}(\{2,\ldots,N\})} p(\{1\} \cup A) = \sum_{A \in {\cal P}_{M-1}(\{2,\ldots,N\})} \dfrac{p_1 + S(A)}{Z}\cr &= \dfrac{p_1}{S} + \sum_{A \in {\cal P}_{M-1}(\{2,\ldots,N\})} \sum_{j \in A} \dfrac{p_j}{{{N-1} \choose {M-1}} S} = \dfrac{p_1}{S} + \sum_{j=2}^N \dfrac{{{N-2} \choose {M-2}} p_j}{{{N-1} \choose {M-1}} S}\cr &= \dfrac{p_1}{S} + \dfrac{M-1}{N-1} \dfrac{S - p_1}{S} = \dfrac{N-M}{N-1} \dfrac{p_1}{S} + \dfrac{M-1}{N-1} } $$ You can use a sequential procedure: first decide (using this probability) whether or not to choose item $1$. Depending on whether you choose $1$ or not, you have $M-1$ or $M$ items to be chosen from the remaining $N-1$. The conditional probability of choosing $2$ given this first choice is then $$ \dfrac{\sum_{A \in {\cal P}_{M-2}(\{3,\ldots,N\})} p(\{1,2\} \cup A)}{P(1)} \ \text{or} \ \dfrac{\sum_{A \in {\cal P}_{M-1}(\{3,\ldots,N\})} p(\{2\} \cup A)}{1 - P(1)}$$ which can be obtained by a similar calculation. After deciding whether or not to choose $2$ using this conditional probability, you calculate the probability for $3$, and continue in this way until all $M$ items are chosen.

  • 0
    Beautiful, thanks. Is this a known algorithm?2012-07-27
  • 0
    In the definition of $Z$, why ${N-1 \choose M-1}$ instead of ${N \choose M}$?2012-07-30
  • 0
    @EmreA: in the calculation of $Z$, you need to count the $M$-element sets $A$ that contain a given element $i$. Such a set is the union of $\{i\}$ and an arbitrary set of $M-1$ elements chosen from the $N-1$ elements that are not $i$.2012-07-30
0

The normalization constant $Z=(M-1)\sum p_i$ as each element is involved in $M-1$ pairs. The probability that item $i$ is chosen is then $P(i)=\frac 1Z \sum_{k \neq i}(p_k+p_i)=\frac 1Z ((M-1)p_i+\frac Z{M-1}-p_i)=\frac 1{M-1}+\frac{(M-2)p_i}Z$ but as mjqxxxx pointed out, you can't just choose elements individually with this probability as the pairs will not be chosen with proper probability. I don't see anything better than listing all the pairs, calculate the probability of each, and pick on that basis. True, this is an M^2 process.

  • 0
    $N=2$ was just an example case.2012-07-27
  • 0
    @EmreA: you can do the same analysis for larger $N$. I am sure that it will look similar-that you have to list all the N-tuples, calculate their weights, and pick from the large list. It will be an $M^N$ process done that way.2012-07-27