2
$\begingroup$

I want to sample a combination of $N$ elements (without replacement) from a list of $M$ elements where $M\gg N$. There are algorithms to do this when each element is picked with uniform probability. I want to do the same for the non-uniform case.

Let each specific element $i$ is associated with a positive constant $p_i$. Then, say for $N=2$, I want the probability of sampling the combination $\{i,j\}$ to be equal to $\frac{p_i+p_j}{Z}$, where $Z$ is the normalization constant, i.e. $Z=\sum_{k,l}p_k+p_l\;\;\forall k,l$.

Any hints, pointers for an efficient and unbiased sampling algorithm are much appreciated.

  • 0
    OK, I slightly changed the wording of the question. I hope it removes the confusion.2012-07-27

2 Answers 2

2

In general, you want to pick $M$ distinct elements where the probability of result $A$ (a subset of $\{1,\ldots,N\}$ is $p(A) = \sum_{i \in A} p_i/Z$, $Z = \sum_A \sum_{i \in A} p_i = {N-1 \choose M-1} \sum_i p_i$. Let $S(A) = \sum_{i \in A} p_i$ and $S = \sum_{i=1}^N p_i$. Let ${\cal P}_k(B)$ denote the collection of all $k$-element subsets of $B$.

The probability of choosing item number $1$ is $\eqalign{P(1) &= \sum_{A \in {\cal P}_{M-1}(\{2,\ldots,N\})} p(\{1\} \cup A) = \sum_{A \in {\cal P}_{M-1}(\{2,\ldots,N\})} \dfrac{p_1 + S(A)}{Z}\cr &= \dfrac{p_1}{S} + \sum_{A \in {\cal P}_{M-1}(\{2,\ldots,N\})} \sum_{j \in A} \dfrac{p_j}{{{N-1} \choose {M-1}} S} = \dfrac{p_1}{S} + \sum_{j=2}^N \dfrac{{{N-2} \choose {M-2}} p_j}{{{N-1} \choose {M-1}} S}\cr &= \dfrac{p_1}{S} + \dfrac{M-1}{N-1} \dfrac{S - p_1}{S} = \dfrac{N-M}{N-1} \dfrac{p_1}{S} + \dfrac{M-1}{N-1} } $ You can use a sequential procedure: first decide (using this probability) whether or not to choose item $1$. Depending on whether you choose $1$ or not, you have $M-1$ or $M$ items to be chosen from the remaining $N-1$. The conditional probability of choosing $2$ given this first choice is then $ \dfrac{\sum_{A \in {\cal P}_{M-2}(\{3,\ldots,N\})} p(\{1,2\} \cup A)}{P(1)} \ \text{or} \ \dfrac{\sum_{A \in {\cal P}_{M-1}(\{3,\ldots,N\})} p(\{2\} \cup A)}{1 - P(1)}$ which can be obtained by a similar calculation. After deciding whether or not to choose $2$ using this conditional probability, you calculate the probability for $3$, and continue in this way until all $M$ items are chosen.

  • 0
    @EmreA: in the calculation of $Z$, you need to count the $M$-element sets $A$ that contain a given element $i$. Such a set is the union of $\{i\}$ and an arbitrary set of $M-1$ elements chosen from the $N-1$ elements that are not $i$.2012-07-30
0

The normalization constant $Z=(M-1)\sum p_i$ as each element is involved in $M-1$ pairs. The probability that item $i$ is chosen is then $P(i)=\frac 1Z \sum_{k \neq i}(p_k+p_i)=\frac 1Z ((M-1)p_i+\frac Z{M-1}-p_i)=\frac 1{M-1}+\frac{(M-2)p_i}Z$ but as mjqxxxx pointed out, you can't just choose elements individually with this probability as the pairs will not be chosen with proper probability. I don't see anything better than listing all the pairs, calculate the probability of each, and pick on that basis. True, this is an M^2 process.

  • 0
    @EmreA: you can do the same analysis for larger $N$. I am sure that it will look similar-that you have to list all the N-tuples, calculate their weights, and pick from the large list. It will be an $M^N$ process done that way.2012-07-27