3
$\begingroup$

We are given positive integers $n$ and $M$, with $n \ll M$, and a sequence $(a_i)_{i=1}^n$ of distinct integers satisfying $0 \le a_i < M$. Let $k$ be the smallest positive integer such that all the $a_i$ are distinct $\mod k$.

Questions: With $n$ and $M$ fixed, what is the expected value of $k$? And what is its maximum possible value?

Motivation: The $a_i$ are breakpoint addresses in a computer program. The number of breakpoints $n$ will typically be less than 50, and the maximum address $M$ will typically be less than half a million. Given an address, I need to be able to find the breakpoint at that address very quickly. One solution is simply a table of pointers, with $M$ entries, but that is very wasteful of space $-$ most of the entries would be zero. So I would like to use a table of size $k$, if I can be sure of finding a suitable $k$ reasonably quickly.

Once $M$ is large enough, its actual order of magnitude won't affect the expected value very much. In other words, as $M \rightarrow \infty$ with $n$ fixed, the expected value of $k$ tends to a limit. But it will certainly affect the worst case. For instance, given any $m$, we can set $a_i = i \times m!$ to ensure that $k$ is greater than $m$.

I have run some simulations. With $n=50$, and $M=2^{30}$, a run of $100000$ trials gives a mean of about $307$, with a worst case of $k=557$. With $n=100$, the mean is about $972$ and the worst case is $1570$.

  • 0
    "1570" should be at the end of the last sentence, but on my screen it is displayed three lines above that. I've noticed similar problems before with LaTeX. Does anybody else have this problem?2011-11-20
  • 0
    TonyK: Yes, I've seen this problem often here; when something in LaTeX is supposed to be at the end of a line, it gets displayed above the beginning of that line or an earlier line. (In this particular case it looks fine for me because on my browser the last line is "is $1570$".)2011-11-20
  • 3
    What you really want is a [perfect hash function](http://en.wikipedia.org/wiki/Perfect_hash_function), right?2011-11-20
  • 0
    @Henning: Yes, but the lookup operation must be very fast, not just $O(1)$ (in my case it's $i$ mod $k$, a single machine instruction). Thanks for the link!2011-11-20
  • 1
    I think I have a formula for the expectation, as a function of $n$; it's a complicated infinite sum, but converges fast enough to be calculated. It equals about 305 for $n=50$ and about 970 for $n=100$. Do you want to run some simulations for other values of $n$? My formula predicts about 71 for $n=20$, about 598 for $n=75$, and about 1941 for $n=150$.2011-11-22
  • 0
    @Greg: $n=20: 71.76$. $n=75: 598.9$. $n=150: 1943.5.$ But the variance is large, so even after 100000 trials, the mean hasn't settled down. Did you sort out your problem with $E(n)$ for general composite $n$?2011-11-22
  • 0
    More or less. The details are hard to get right, but the match to actual data is promising. (And once the formula is written down, analyzing its size as a function of $n$ is yet another story....) I'll keep working on it and let you know what comes of it.2011-11-22
  • 0
    Ok, I'm starting a new answer (this seems slightly better than re-editing my existing answer).2011-11-28

3 Answers 3

0

Assuming that the $a_i$ are randomly distributed in ${0,...,k-1}$ and that for two different k this distributions are

independent then the problem to found the probability that all $a_i$ are different is the birthday problem. I

implemented in maxima the the probability

f(n,k):=combination(k,n)*n!/k^n

that all $n$ numbers randomly chosen from ${0,...,k-1}$ (this needs the load(functs); before executiong this definition)

are different and also an approximation

p(n,k):=exp((-n^2)/(2*k))

of $f$.

I estimate the mean by

ll:1;
m:0;
n:50;
L:800;
for k:n thru L do (
  q:float(f(n,k)),
  m:m+ll*q*k,
  ll:ll*(1-q)
);
m;

and get $305.2774133346098$. Using the approximation $g$ instead of $f$ I get the smaller value $293.0324415803322$. For n=100 I evaluate the for-loop until L:2000 to get $970.2315452158832$ (using f) and $944.4295352864864$ (using p) for

the mean.

For $n=1000$ I only eveluate the approximation until L:86000 and get $55595.16420043662$

To estimate the maximum I calculated

ll:1;
m:0;  
n:50;
L:675;
for k:n thru 675 do (
  q:float(f(n,k)),
  m:m+ll*q,
  ll:ll*(1-q)
);
m;

This is the probability that $k$ is between $50$ and $675$. The result is $0.99999999999998$. So the probability that k is larger than $675$ is very small. For $n=100$ and $n=1000$ I calculated values about L:1900 and L:86000 for the upper bound containing most of the $k$.

2

Some thoughts:

Your example $a_i = im!$ gives a $k$ whose size is about $(\log(M/n))/\log\log(M/n)$, since that's how large one can choose $m$ before $nm!$ exceeds $M$. One can get a worse $k$ as well, by choosing $a_1=0$, $a_2$ to be the product of all primes up to about $\log M$, $a_3$ to be the product of all primes between $\log M$ and $2\log M$, and so on: then $k$ is at least $(n-1)\log M$ asymptotically. (This is a slight lie, as one needs to put the prime powers in there somewhere too. The asymptotic should still be correct.) Of course this doesn't even take advantage of primes dividing $a_3-a_2$ and so on.

When $M$ is much larger than a prime $p$, the probability that the $a_i$ are distinct modulo $p$ is essentially $(p/p)((p-1)/p)\cdots((p-n+1)/p) = p!/((p-n)!p^n)$; if $p=\lambda n$, then by Stirling's formula this is about $c(\lambda)^n$ where $c(\lambda) = (\lambda/(\lambda-1))^{\lambda-1}/e < 1$. This equals $1/2$ when $\lambda$ is about $n/(2\log 2)$, or $p\approx n^2/(2\log 2)$. But the question of finding even the median (never mind the expectation) of the least $p$ for which they are distinct modulo $p$ seems more daunting. (At least if $M$ is sufficiently large, one should be able to consider these modulo-$p$ events essentially independent.)

[EDITED TO ADD:] I was thinking so much about the asymptotics of the smallest such $k$ that I didn't realize I could at least write down an expression for the expectation of that $k$. Let $D(p) = p!/((p-n)!p^n)$ be the probability that the $a_i$ are distinct modulo $p$; we define $D(p) = 0$ if $p

1

We choose $n$ integers independently uniformly at random from $\{1,2,\dots,M\}$ where $M$ is very large. (The original problem specified that they must be distinct integers, but that probably won't affect the final answer, so let's ignore it.) Let ${\mathbf C}(k)$ denote the event that the $n$ integers are not distinct modulo $k$. Define$$ F(k) = \Pr\big( {\mathbf C}(k) \ \big|\ {\mathbf C}(1) \wedge \cdots \wedge {\mathbf C}(k-1) \big) $$ and note that $1-F(k)$ is the probability that $k$ is the least integer modulo which the $n$ integers are distinct, conditional on knowing that they are not distinct modulo any of $1,2,\dots,k-1$. Note also that$$ F(1) F(2) \cdots F(k-1) = \Pr \big( {\mathbf C}(1) \wedge \cdots \wedge {\mathbf C}(k-1) \big), $$ as can be shown by expanding the definition of conditional probability and telescoping the resulting product. Therefore the expectation of the least $k$ for which the $n$ integers are distinct is given by the expression$$\mu= \sum_{k=1}^\infty k \Pr \big( {\mathbf C}(1) \wedge \cdots \wedge {\mathbf C}(k-1) \big) \big( 1 - \Pr\big( {\mathbf C}(k) \ \big|\ {\mathbf C}(1) \wedge \cdots \wedge {\mathbf C}(k-1) \big) \big), $$ or equivalently$$\mu=\sum_{k=1}^\infty \bigg( k (1-F(k)) \prod_{1\le j\le k-1} F(j) \bigg). $$ Therefore it suffices to be able to calculate $F(k)$. All this has been probability so far; now we need to input some number theory.

Define $c(k) = \Pr({\mathbf C}(k))$. It is not hard to show that$$ c(k) = 1 - \frac kk \frac{k-1}k \cdots \frac{k-n+1}k = 1-\frac{k!}{k^k(k-n)!} $$ (where the last expression must be interpreted appropriately when $k

Proving this statement would be the most technical part of the derivation, I think. [Intuitively, the reason this should be true is: when $k$ and $l$ are relatively prime, then the events ${\mathbf C}(k)$ and ${\mathbf C}(l)$ will turn out to be independent (up to a vanishingly small error as $M$ tends to infinity). So the only events ${\mathbf C}(l)$ that directly influence ${\mathbf C}(k)$ are when $l$ divides $k$. Furthermore, the event ${\mathbf C}(k)$ is a subset of ${\mathbf C}(l)$ whenever $l$ divides $k$. Therefore we only need to consider the maximal prime-power divisors of $k$, and these events are all independent.]

The definition of $f(k)$ can be coded easily, and when substituted into the second equation for $\mu$ above (replacing $F$ by $f$), it leads to the values I mentioned in my comments on the original post.