Suppose I have a line segment of length $L$. I now select two points at random along the segment. What is the expected value of the distance between the two points, and why?
Average Distance Between Random Points on a Line Segment
- 
10$L/3$, by symmetry. – 2012-09-13
- 
2Care to elaborate, @Byron? – 2012-09-13
- 
0@David I've added more explanation in my answer below. – 2012-09-13
- 
0Can anyone explain how by symmetry we are getting the value as @user940 has stated? – 2018-12-04
10 Answers
Byron has already answered your question, but I will attempt to provide a detailed solution...
Let $X$ be a random variable uniformly distributed over $[0,L]$, i.e., the probability density function of $X$ is the following
$$f_X (x) = \begin{cases} \frac{1}{L} & \textrm{if} \quad{} x \in [0,L]\\ 0 & \textrm{otherwise}\end{cases}$$
Let us randomly pick two points in $[0,L]$ independently. Let us denote those by $X_1$ and $X_2$, which are random variables distributed according to $f_X$. The distance between the two points is a new random variable
$$Y = |X_1 - X_2|$$
Hence, we would like to find the expected value $\mathbb{E}(Y) = \mathbb{E}( |X_1 - X_2| )$. Let us introduce function $g$
$$g (x_1,x_2) = |x_1 - x_2| = \begin{cases} x_1 - x_2 & \textrm{if} \quad{} x_1 \geq x_2\\ x_2 - x_1 & \textrm{if} \quad{} x_2 \geq x_1\end{cases}$$
Since the two points are picked independently, the joint probability density function is the product of the pdf's of $X_1$ and $X_2$, i.e., $f_{X_1 X_2} (x_1, x_2) = f_{X_1} (x_1) f_{X_2} (x_2) = 1 / L^2$ in $[0,L] \times [0,L]$. Therefore, the expected value $\mathbb{E}(Y) = \mathbb{E}(g(X_1,X_2))$ is given by
$$\begin{align} \mathbb{E}(Y) &= \displaystyle\int_{0}^L\int_{0}^L g(x_1,x_2) \, f_{X_1 X_2} (x_1, x_2) \,d x_1 \, d x_2\\[6pt] &= \frac{1}{L^2} \int_0^L\int_0^L |x_1 - x_2| \,d x_1 \, d x_2\\[6pt] &= \frac{1}{L^2} \int_0^L\int_0^{x_1} (x_1 - x_2) \,d x_2 \, d x_1 + \frac{1}{L^2} \int_0^L\int_{x_1}^L (x_2 - x_1) \,d x_2 \, d x_1\\[6pt] &= \frac{L^3}{6 L^2} + \frac{L^3}{6 L^2} = \frac{L}{3}\end{align}$$
- 
2Was this problem ever studied with N randomly placed points? – 2017-04-10
- 
0@xyz Yes. It's $\frac{L}{N^2 - 1}$ Check out this answer https://mathoverflow.net/questions/1294/mean-minimum-distance-for-n-random-points-on-a-one-dimensional-line – 2018-06-18
Sorry. I posted a cryptic comment just before running off to class. What I meant was that if $X,Y$ are independent uniform $(0,1)$ random variables, then the triple $$(A,B,C):=(\min(X,Y),\ \max(X,Y)-\min(X,Y),\ 1-\max(X,Y))$$ is an exchangeable sequence. In particular, $\mathbb{E}(A)=\mathbb{E}(B)=\mathbb{E}(C),$ and since $A+B+C=1$ identically we must have $\mathbb{E}(B)=\mathbb{E}(\mbox{distance})={1\over 3}.$
Intuitively, the "average" configuration of two random points on a interval 
looks like this:   
- 
1What I like the most about this answer is that it can be used to prove the case where we have $n$ points and not only two points $x$ and $y$. Or am I missing something ? – 2013-09-11
Without loss of generality, let the interval length be L and the sought average difference D.
The answer is seen to be D=L/3 by the following simple argument: Select 3 random values a,b,c on the interval of length L. The probability that c is between a and b equals 1/3 since the other equiprobable alternatives are that a or b are in the middle. Since the interval length is L, the probability must also be D/L, thus D=L/3.
Let $X_1, X_2$ be independent uniformly distributed on $[0,1]$.
Assume $X_1=x_1$. Then $P(X_2
- 
0In the fourth sentence ("Moreover..") there are two places where on the RHS you wrote $x_2$ when you meant $x_1$, if I am not mistaken. – 2012-09-13
- 
0Yes, I did. Thanks – 2012-09-13
Let $X_1$ and $X_2$ be independent identically distributed random variables, with $f_X(x) = [0 For simplicity assume $L=1$. Therefore $|X_1-X_2| \stackrel{d}{=} |X_1+X_2-1|$. Random variable $D = X_1+X_2-1$ follows symmetric triangular distribution on $(-1,1)$, being a special case of Irwin-Hall distribution. We immediately have:
$$
   f_{|D|}(\ell) = 2 (1-\ell)[0<\ell<1]
$$
Immediately yielding the expectation:
$$
    \mathbb{E}(|D|) = \int_0^1  2 \ell(1-\ell) \mathrm{d} \ell = \frac{1}{3}
$$
- 
2What does $\buildrel d\over=$ mean? – 2012-09-13
- 
1@MJD Symbol $\stackrel{d}{=}$ stands for "[equality in distribution](http://en.wikipedia.org/wiki/Random_variable#Equality_in_distribution)". – 2012-09-13
Byron's answer is short and elegant.
Here's a geometric/algebraic derivation: Let $X$ and $Y$ be two independent uniform variates in $[0,1]$. Then 
$$p\left(\{|X-Y|>s\}\right) = (1-s)^2 $$
as can be seen by viewing $(X,Y)$ as a uniform variate in $[0,1]^2$, where the set $\{|X-Y|>s\}$ occupies the top left and bottom right triangles of the square box $[0,1]^2$. See figure (a). Now we have $p\left(\{|X-Y|
Sidenote: The discrete case is analogous. Let $(X,Y)$ be independent uniform variates in $\left\{1,2,\dots,N\right\}^2$. Then $$ p\left(|X-Y|=s\right) = 2(N-s), \quad s>0$$ as can be seen from mlohbiehler's plot. (Note that the equation is not valid for $n=0$.) The expectation is $$\mathbb E |X-Y| = {1\over N^2}\sum_{s=1}^N s\,2\,(N-s) = \frac{N^2-1}{3 N} .$$ Asymptotically for $N\to\infty$ this is ${N\over3}$.
You can picture this problem from a discrete approach, then extend it to the real number line.
For the first L natural numbers, the difference between any two of them may range from 1 through L-1. Exactly N pairs of numbers from our set will be L-N units apart. Taking that into account we sum:
Sum of distances = sum (0 < x < L, x*(L-x))
Which turns out to be (L-1)(L)(L+1)/6
We then divide by combinations of L in 2 and get the average distance (L+1)/3, which becomes L/3 as we allow for infinitely many more numbers in the [0;L] interval.
I get a close but different result when I approach this with brute force and ignorance. Consider an example: if my random numbers are between 5 and 14 (L=10) and uniformly distributed, I can expect after sufficient iteration to have equally observed each number for both values. Thus I can simply consider a matrix of the differences between each value, and calculate the average of the sum of the values in the matrix.
(Please excuse my formatting. I'm new here.)
     5  6  7  8  9 10 11 12 13 14
    -----------------------------
 5 | 0  1  2  3  4  5  6  7  8  9
 6 | 1  0  1  2  3  4  5  6  7  8
 7 | 2  1  0  1  2  3  4  5  6  7
 8 | 3  2  1  0  1  2  3  4  5  6
 9 | 4  3  2  1  0  1  2  3  4  5
10 | 5  4  3  2  1  0  1  2  3  4
11 | 6  5  4  3  2  1  0  1  2  3
12 | 7  6  5  4  3  2  1  0  1  2
13 | 8  7  6  5  4  3  2  1  0  1
14 | 9  8  7  6  5  4  3  2  1  0
The sum of the differences in this matrix is 330, the count 100, giving an average of 3.3, not 10/3 as expected in other answers.
I don't see anything wrong with this except that i can't see anything wrong with the other answers. Can anyone explain the difference?
- 
1$10/3=3.333\ldots$ – 2015-08-18
- 
0Which is not 3.3. Like I said, close but different. – 2015-08-18
- 
1I think you're just seeing the difference between discrete and continuous. I suspect if you increased $L$ without bound, the average would indeed approach $L/3$. – 2015-08-18
- 
1Your approach is generally correct, but the result is not absolutely accurate. In your experiment you broke the line into 10 parts. What if it was only 2 parts? Or only 1 part? (with 1 part your answer will be 0 - very inaccurate). Breaking into two parts will give you 0.25. Breaking into 10 parts gives 0.33. Why do you think 0.33 is correct, why not 0.25? Of course, more parts you take, more accurate your answer will be. – 2015-11-21
- 
0@lesnik, mcmeyer provided the answer below, referencing Byron's work. In discrete cases such as mine and your examples, the answer is what it is, and it is correct. With only one part (i.e. a point) the distance between points will always be zero, so the answer of 0 will in fact be quite accurate. However, as N approaches infinity, the average approaches N/3. – 2015-11-23
Differences represented by a line from 0-15
I took the brute force approach as noted above and thought about letting the size of the line (and hence the square with the differences) range from 0 to infinity. Taking the total absolute differences for squares ranging from 0 to n then the total number of integer points that can be chosen is n+1. The total count of the number of differences calculated is (n+1)^2. The total of all the differences calculated is twice the triangular numbers which is 2(n(n+1)(n+2))/6. Therefore the average distance between two points is 2(n(n+1)(n+2))/6 all divided by (n+1)^2. To normalise this to the equivalent of a unit line we divide again by (n+1).
For the 0-15 table as above n=15 and the average distance is 0.33203 for n=20 the average distance is 0.33258 and for n=25 the distance is 0.33284. Clearly getting closer to 1/3.
If we let n get larger towards infinity then what happens to 2(n(n+1)(n+2))/6(n+1)(n+1)^2?
This becomes 2/6 or 1/3.
Hopefully this makes sense and is another way to look at the problem.
R
The pdf of the distance is highest closer to zero. This means that distance is more likely to be close to zero than any other positive value. This seems to be puzzling. Put another way, let us divide the segment into equal sized bucket, say 100 buckets .01 length each and number 0 to 99 with bucket i from i*.01 to (i+1)*.01. If you do the experiment for a large number of pairs of points and divide them into buckets, then bucket 0 will contain the largest number of samples. This seems counter intuitive at first. But it makes sense, since after selecting a point closer to the edge makes it twice more likely to select a second point closer to the first point than one farther from it. The closer point can be on either side of the original point but the farther point can only be on one side.
