8
$\begingroup$

I have limited access to a collection $X_1,\ldots,X_m$ of sets of positive integers. Each $X_i$ is "moderately large" (a brief survey has found them to contain about $10^6$ elements in each set), but $X_i ∩ X_j = \{\}$ unless $i=j$. $m$ is between $50000$ and a few million, so also "moderately large".

I also have limited access to a sequence $Y_1,\ldots,Y_n$ of elements of $X$, where $n$ is really fairly small, like $n ≤ 50$. I would like to know the identification $f : [1..n] → [1..m]$ so that $Y_i = X_{f(i)}$.

The problem is I only have limited access to the $X_i$ and the $Y_i$: I have random variables $x_i$ and $y_i$ on unknown, but hopefully somewhat uniform, measure spaces, that take values in $X_i$ and $Y_i$.

Sampling $x_i$ and $y_i$ takes a small but non-neglible time, and I wish to compute $f$ as quickly as possible. My plan is to simply keep records of the samples, and look for collisions, since $X_i$ and $Y_j$ are either equal or disjoint. I can generate about one thousand samples per second.

The question is how often should I sample from the $x_i$, and how often from the $y_i$?

I began by only sampling once from the $x_i$, and then sampling the $y_j$ until I found that specific value of $x_i$, but this seems very unlikely to happen early (out of $578000$ $y$ samples, I have $532405$ distinct values, none of which are the specific $x$ values found). On the other hand, sampling often from the $x$ is going to get very expensive simply because there are so many of them and so few of the $y$ in comparison!

How do I quantify this?

  • 0
    @D.W.: there are no typos that I can see. $X_i \in X$, not $X_i \subset X$. I think *X* is not used in any material way though. In your notation, *m* is between 50000 and 1000000, and *s* = 1000000000, and *n* is around 50.2011-11-19

1 Answers 1

3

I'm going to make the simplifying assumption that all of the $X_i$s are of the same size, i.e., $|X_1|=|X_2|= \cdots = |X_m|$. If this assumption is offensive, please update the question to more clearly indicate the various parameters.

Under that assumption, the optimal strategy appears to be that about half of your samples should come from $X_i$s and about half from $Y_j$s. In other words, in each step of your algorithm, you flip a fair coin. If it comes up heads, you randomly pick $i$ and then randomly sample a value from $X_i$. If it comes up tails, you randomly pick $j$ and then randomly sample a value from $Y_j$. Keep going until you have classified the entire function $f$.

(There's an obvious refinement: once you discover that $i=f(j)$, there is no point in sampling from $Y_j$ any further. Thus, when the coin comes up tails, you should actually pick $j$ from among the values whose $f$-value is not yet known.)

Justification. Suppose $|X_i|=s$ for all $i$. Note that, if you have a list of $\alpha$ random elements from a set $X_i$ and a second list of $\beta$ more random elements from the same set, then the chances that there is no collision (some entry in the first list and some entry in the second list that are the same) is about $e^{-\alpha \beta/s}$. If we take $\alpha \beta \approx 10 s$, then a collision is nearly a sure thing.

Now let's suppose we flip a biased coin, with heads probability of $p$ and tails probability of $1-p$, and then follow the algorithm above: if heads, sample from $X_i$ where $i$ is randomly chosen, else sample from $Y_j$ where $j$ is randomly chosen. Fix a particular pair $I,J$ where $I=f(J)$. After $t$ steps of this algorithm, we can expect to have about $\alpha = tp/m$ elements from $X_I$ and about $\beta = t(1-p)/n$ elements from $Y_J=X_I$. The algorithm will recover the correspondence $I=f(J)$ only if there is a collision between the list of $\alpha$ elements from $X_I$ and the list of $\beta$ elements from $Y_J$. By our assumption, $|X_i|=s$, so we are in the situation of the previous paragraph. As long as $\alpha \beta \approx 10 s$, it is almost a sure thing that there will be some collision, i.e., almost a sure thing that the algorithm will succeed to recover the relationship $I=f(J)$. (Since there are 50 such relationships, by a union bound, the chances that we fail to recover any of the relationships is at most $50 e^{-10} \approx 0.002$. Presumably a probability of failing of about $0.002$ is acceptable, especially since in that case you can just keep running the algorithm a little longer. So from here on, I will ignore the failure probability.)

So, we need $\alpha \beta \approx 10 s$, where $\alpha = tp/m$ and $\beta = t(1-p)/n$ and $t$ denotes the number of steps taken by the algorithm. After a little bit of algebraic manipulation, we find that the algorithm takes $t \approx \sqrt{10s mn/p(1-p)}$ steps. Also, $p$ is a free variable that we can choose freely. To make the algorithm as efficient as possible, we want to minimize $\sqrt{10s mn/p(1-p)}$. This quantity is minimized when $p(1-p)$ is maximized, i.e., when $p=1/2$.

Example. For example, suppose $m = 10^6$, $|X_i| = 10^6$, and $n = 50$. Then the analysis above suggests that $t \approx $ 45 million samples should be more than enough. If you can generate one thousand samples per second, then you should be able to generate that many samples is around 12 hours or so.

Sanity check: with these parameters, after the algorithm finishes, you'll have about 22 samples from each $X_i$ and about 450,000 samples from each $Y_j$. The chances of missing a collision between $X_i$ and $Y_j$, in a case where $i=f(j)$, is about $e^{-22 \times 450,000/10^6} \approx 0.00005$ (for any one fixed choice of $i,j$). This should be adequate to recover the entire function $f$ with high probability.

  • 0
    For those of us with weak backgrounds in probability, the e^(-ab/s) calculation is done very nicely at the undergrad level in http://books.google.com/books?id=XLY9AnfDhsYC&pg=PA2292011-11-19