10
$\begingroup$

Suppose I pick 'N' integers over an interval [A, B] without replacement. As a function of 'N' and the interval length, what distribution / average values should I expect for the distances between nearest-neighbors in a sorted array of the selected integers?

Edit: I apologize, an important note is that the distances between the endpoints and the nearest integers to the endpoints should also be included. This is a bit like dividing a piece of rope into (B - A + 1) segments, cutting at the locations representing the 'N' selected integers, and looking at the distribution of cut rope lengths.

Edit 2: Apparently this question is in desperate need of clarification. Extending the rope example I provided, here's exactly what I'm looking for:

Upon cutting the rope into 'N' pieces, and placing these pieces in a bag, I would very much like the probability, P(k), of randomly selecting a fragment of rope of length 'k' from this bag. Here, the probability of selecting a particular fragment of the rope is independent of its length. The function for P(k) provides what I'd like to know about the distribution of rope lengths after 'N' cuts.

  • 1
    Can you pick the same point twice?2011-04-03
  • 0
    @Henry: That happens w.p. $0$ unless $A = B$.2011-04-03
  • 0
    @Yuval Filmus: "Integers" - so with/without replacement matters2011-04-03
  • 0
    @yuval: the questioner specified that they were picking integers, not reals, so there's a non-zero chance of it happening. In general, I suspect the discrete version of this problem is likely to be much harder than the continuous version.2011-04-03
  • 0
    Dear Henry, I apologize for the delayed response - no, you can only pick a given integer once.2011-04-04
  • 0
    Dear Steven, yes, I should specify that I'm interested in the specified discrete (vs. continuous) version of the problem.2011-04-04
  • 0
    @user8861: if you preface comments that you want to reply to with @, where is at least (I hear) 3 characters of the username, the original commenter will get a message. Most seem to use the whole first name, either out of respect or because they don't know three characters are enough.2011-04-04
  • 0
    Note the connection to this question: http://math.stackexchange.com/questions/25295/simulating-uniformly-on-s1-x-in-mathbbrn-mid-x-1-1/25308#25308. Section 4.3 of Mariano's reference is relevant to the present problem without replacement.2011-04-04
  • 0
    @Ross Do you know where the practical details of this @ thing are explained? Thanks in advance.2011-04-04
  • 1
    @Didier: http://meta.math.stackexchange.com/questions/1694/notifying-users-in-comments2011-04-04
  • 0
    @joriki Thanks. This thread (and the link in the first answer) is exactly what I asked for.2011-04-04
  • 0
    @Didier: You're welcome. Why did you delete your answer?2011-04-04
  • 0
    @joriki Because the original question (which I misread) was entirely different. It happens that I more or less answered to the new version of the question (which is less interesting). I will repost my answer.2011-04-04
  • 2
    @user8861: I think you should clarify your question, and state exactly which parameter you are interested in: minimum distance between two points?, average distance over the whole rope?, average distance over your selected points? none of the above?2011-04-04

5 Answers 5

2

Let $X_{(1)}, \ldots, X_{(N)}$ be the chosen integers in increasing order (the order statistics). For simplicity I'll suppose $A = 1$. Of course we must have $B \ge N$. Then I claim that all the "gaps" $X_{(j+1)} - X_{(j)}$ as well as $B+1 - X_{(N)}$ and $X_{(1)} - 0$ have expected value $(B+1)/(N+1)$.

Note that $E[X_{(1)} | X_{(2)}] = X_{(2)}/2$, because given $X_{(2)} = x$, $X_{(1)}$ is equally likely to be any of the integers 1 to $x-1$. Thus $E[X_{(1)}] = E[X_{(2)} - X_{(1)}]$. Similarly, given $X_{(j)} = x$ and $X_{(j+2)} = y$, $X_{(j+1)}$ is equally likely to be any of the integers $x+1$ to $y-1$, so $E[X_{(j+2)} - X_{(j+1)}] = E[X_{(j+1)} - X_{(j)}]$. Similarly, $E[B+1-X_{(N)}] = E[X_{(N)} - X_{(N-1)}]$. Thus all $N+1$ gaps have the same expected value, and since they add up to $B+1$ that expected value is $(B+1)/(N+1)$.

1

Edit The answer below addresses a different question than the original one. That was a mistake of mine, properly signaled by Matthew in a comment, so I deleted my answer. Later on, the OP added some so-called precisions to the question, which in fact change it completely. As a consequence of this modification of the question, my answer becomes relevant, miraculously (modulo the endpoints thing). Call this a manifestation of prescience if you want, anyway I repost my answer, and this is the end of my interventions on this page.


There are $N-1$ distances between nearest-neighbors amongst $N$ points so the mean distance (averaged over a given sample) is the span of the sample divided by $N-1$. The span is the maximum $M$ of the sample minus the minimum $m$ of the sample. By symmetry, $m$ is distributed like $B+A-M$ hence the mean distance (averaged over the samples) is $$ E(S)=\frac1{N-1}E(M-m)=\frac1{N-1}(2E(M)-(A+B)). $$ For each $n$ such that $N\le n\le B-A$, there are $n!/(n-N)!$ samples such that $M\le A+n$, hence $$ B+1-E(M)=\sum_{n=N}^{B-A}P(M\le A+n)=\frac{(B-A-N)!}{(B-A)!}\sum_{n=N}^{B-A}\frac{n!}{(n-N)!}. $$ Putting all this together should yield $E(S)$.

  • 0
    Should all of your A-B expressions be B-A?2011-04-04
  • 0
    @Matthew You are right. Corrected. Thanks.2011-04-04
  • 0
    @Didier Piau I don't follow your argument. The span divided by N-1 does not seem to give the mean nearest neighbor distance. For instance, if A=1, B=10, and N=4, the sample {1, 2, 9, 10} has mean nearest neighbor distance equal to 1 (since each element is exactly a distance 1 from its nearest neighbor), not 9/3, which is the average "gap". Can you clarify? Thanks.2011-04-04
  • 0
    @Matthew I see... If you don't mind, I will cancel this post.2011-04-04
  • 0
    @Didier Piau I don't mind at all. Cheers.2011-04-04
  • 0
    @Matthew See the modified question. No nearest-neighbor distance at all, in fact... Ironical, isn't it?2011-04-04
1

For some reason, all existing answers to this question, including the accepted one, only discuss either the expected value of the nearest-neighbour distances or the distribution of the shortest nearest-neighbour distance, but not the distribution of the nearest-neighbour distances. This was also the subject of the later question Distribution probability of elements and pair-wise differences in a sorted list, which is very similar to this one, though neither is an exact duplicate of the other.

My answer to that question, adapted to the notation of the present question, gives the probability

$$P(d_i=d)=\frac{\binom{B-A+1-d}{N-1}}{\binom{B-A+1}N}$$

that the $i$-th nearest-neighbour distance $d_i$ is $d$, independent of $i$.

0

I'm not sure if I interpret your question correct, so I tell how I understood you.

You're picking $N$ integers ($X_1,\dots,X_N$) out of the intervall $[A, B]$. Then you obtain w.l.o.g. an ascending sequence of $X$s and are interested in the average distance between two consecutive points, i.e. you want to know

$$ \frac{(X_1-A) + (X_2 - X_1) + (X_3 - X_2) + \dots + (X_n-X_{n-1}) + (B-X_n)}{n+1} = \frac{B-A}{n+1}. $$

The above simplification follows from the fact that you can evaluate the numerator is a telescope sum. The result isn't random at all, as you can see, i.e. the expected value of the average distance between neighbours is simply $\frac{B-A}{n+1}$.

  • 0
    You've considerably simplified the problem by including the endpoints of the interval :-) The question refers to the distances between nearest neighbours of the selected integers, so $X_1-A$ nd $B-X_n$ don't contribute to the sum.2011-04-04
  • 0
    @joriki: I thought that was it that was added by the OP by his edit-paragraph.2011-04-04
  • 0
    @joriki @phimuemue: I would take the OP to be asking for the length of the shortest piece of rope after the cuts. But I don't think the OP was asking average length per piece of $N+1$ pieces of rope adding up to $B-A$.2011-04-04
  • 0
    Sorry, I hadn't noticed the edit.2011-04-04
0

There are $N$ places out of $B - A - 1$ where cuts can be made so we must have $N+1 \ge B - A $ to be able to make any cuts. The shortest distance $s$ between neighbours (i.e. cuts and endpoints) must satisfy $s(N+1) \ge B-A$ so the widest possible shortest distance between neighbours is $\lfloor \frac{B-A}{N+1} \rfloor$.

There are $B-A-1 \choose N$ ways of making the $N$ cuts. If the shortest difference between neighbours is at least $s$, then there are $B-A-1 - (s-1)N \choose N$ ways of making the $N$ cuts. So the probability that the shortest distance between neighbours is exactly $s$ is $$Pr(S=s) = \frac{{B-A-1 - (s-1)N \choose N} - {B-A-1 - sN \choose N} }{B-A-1 \choose N}.$$

The expected shortest distance between neighbours is therefore $$E[S]=\sum_{s=1}^{\lfloor \frac{B-A}{N+1} \rfloor} \frac{{B-A-1 - (s-1)N \choose N} }{B-A-1 \choose N} . $$

As an illustration, if $A=10$, $B=20$ and $N=3$, then the widest possible shortest distance is $2$. There are 20 ways of the shortest distance being 2, and 64 ways of it being 1, out of a total of 84. The expected shortest distance is therefore $\frac{104}{84} \approx 1.238$.