I have a set of elements $S$ and a set of "constraints" $C$, where $\forall c \in C, c \subseteq S$. The sets in $C$ may overlap, e.g. $C$ could be $\{ \{0, 1, 2\}, \{1, 2, 5, 7\}, \{2, 5\}, \{9\}, ...\}$.
I want to choose a subset $T \subseteq S$ containing exactly $N$ elements, where each $t \in T$ is chosen uniformly from $S$ except we also have the condition that $T$ must be a superset of at least one $c \in C$.
I can do this with rejection sampling: draw $N$ elements from $S$, check if any $c \in C$ is a subset, repeat if not. However, I want to use this in a computer program and this is far too time consuming when these sets have hundreds of elements.
I'd like to know if there's a faster method which follows the same distribution. For example we could select some $c \in C$ and pad it uniformly up to size $N$, which guarantees that our condition is satisfied. Unfortunately I haven't managed to figure out the probabilities to use when selecting $c$, due to constraints possibly overlapping.
I'd appreciate if someone could either help with these probabilities, or suggest some other efficient method, suitable for a program to use, which gives the same distribution.
We can assume that we're not asking for too many elements (i.e. $N \leq |S|$) or too few (i.e. there is some $c \in C$ where $|c| \leq N$).
In fact, it seems that any $c \in C$ where $|c| \gt N$ won't make a difference to the rejection sampling, since they can never prevent a sample from being rejected. Hence, if it helps, we can assume that such constraints have been removed from $C$ before we begin.
Similarly, it seems (correct me if I'm wrong) that if some constraint $c \in C$ is a subset of some other constraint $d \in C$ then we don't need to consider $d$, since its effects on the rejection sampling are subsumed by that of $c$. Such redundant constraints can likewise be removed before we begin.
Some things I've considered:
- If $C$ contains a singleton set $\{s\}$ for each $s \in S$, then nothing will get rejected and we can just sample uniformly.
- If the smallest constraints in $C$ have size $N$, we can choose between them uniformly to get our sample.
- Constraints with few elements should be more likely to appear in our sample than those with many elements.
- The chance of some $c \in C$ being sampled uniformly (i.e. ignoring rejections) seems to be ${{|S| - |c|}\choose{N - |c|}}/{{|S|}\choose{N}}$ since that's the chance of choosing the remaining elements as our "padding".
- Something like relative frequencies of constraints might be a better approach than computing with raw probabilities, since they might get lost in floating point rounding errors.