6
$\begingroup$

Say you have 2 iid random variables $x,y\sim U[0,1]^k$, i.e. the uniform distribution over the k-dimensional unit cube. What's the expected value of the Euclidean distance between them when they have been normalized by the maximum distance possible, i.e. $\sqrt k$?

For $k=1$, I worked out that this is 1/3. For $k=100$, monte carlo simulations tell me it's a little over 0.4.

I tried to work out the math but $$\frac{1}{\sqrt{k}}\int_{S_x} \int_{S_y} \sqrt{(x-y)^T (x-y)}\;dx\;dy$$ where $S_x,S_y=[0,1]^k$ for general $k$ is beyond me.

2 Answers 2

7

The asymptotics is $1/\sqrt6=0.40824829$.

To see this, consider i.i.d. random variables $X_i$ and $Y_i$ uniform on $[0,1]$ and write the quantity to be computed as $I_k=\mathrm E\left(\sqrt{Z_k}\right)$ with $$ Z_k=\frac1k\sum_{i=1}^k(X_i-Y_i)^2. $$ By the strong law of large numbers for i.i.d. bounded random variables, when $k\to\infty$, $Z_k\to z$ almost surely and in every $L^p$, with $z=\mathrm E(Z_1)$. In particular, $I_k\to \sqrt{z}$. Numerically, $$ z=\iint_{[0,1]^2}(x-y)^2\mathrm{d}x\mathrm{d}y=2\int_0^1x^2\mathrm{d}x-2\left(\int_0^1x\mathrm{d}x\right)^2=2\frac13-2\left(\frac12\right)^2=\frac16. $$


Edit (This answers a different question asked by the OP in the comments.)

Consider the maximum of $n\gg1$ independent copies of $kZ_k$ with $k\gg1$ and call $M_{n,k}$ its square root. A heuristics to estimate the typical behaviour of $M_{n,k}$ is as follows.

By the central limit theorem (and in a loose sense), $Z_k\approx z+N\sqrt{v/k}$ where $v$ is the variance of $Z_1$ and $N$ is a standard gaussian random variable. In particular, for every given nonnegative $s$, $$ \mathrm P\left(Z_k\ge z+s\right)\approx\mathrm P\left(N^2\ge ks^2/v\right). $$ Furthermore, the typical size of $M_{n,k}^2$ is $z+s$ where $s$ solves $\mathrm P(Z_k\ge z+s)\approx1/n$. Choose $q(n)$ such that $\mathrm P(N\ge q(n))=1/n$, that is, $q(n)$ is a so-called quantile of the standard gaussian distribution. Then, the typical size of $M_{n,k}^2$ is $k(z+s)$ where $s$ solves $ks^2/v=q(n)^2$. Finally, $$ M_{n,k}\approx \sqrt{kz+q(n)\sqrt{kv}}. $$ Numerically, $z=1/6$, $v=7/180$, and you are interested in $k=1'000$. For $n=10'000$, $q(n)=3.719$ yields a typical size $M_{n,k}\approx13.78$ and for $n=100'000$, $q(n)=4.265$ hence $M_{n,k}\approx13.90$ (these should be compared to the values you observed).

To make rigorous such estimates and to understand why, in a way, $M_{n,k}$ concentrates around the typical value we computed above, see here.

  • 0
    Thanks! This makes perfect sense. One question about monte carlo experiments: there's a fairly big difference in the expected value when I fix the max distance possible to the theoretical $\sqrt(k)$ and when I take the actual max that's been observed from the samples. The latter actually gives the sample expectation to be about 0.8ish. I know that observing $\sqrt{k}$ from samples is near impossible. But will the monte carlo estimate converge almost surely to $1/\sqrt{6}$ for large enough $n$?2011-09-13
  • 0
    Wait, you simulate (pairs of) points in the unit cube and not their distances, do you? So I am not sure of the reason why you have to fix a max distance.2011-09-13
  • 0
    Let me elaborate: if $X$ and $Y$ are independent and uniform on $[0,1]$, neither $|X-Y|$ nor $(X-Y)^2$ are uniform on $[0,1]$. So if you simulate either one of these directly as uniform (instead of simulating $X$ uniform then $Y$ uniform then combining them), the result has little to do with $I_k$.2011-09-13
  • 0
    I'm simulating as you describe it (in R: `for(i in 1:10000) {v <- nrm(runif(1000),runif(1000)); if (v > mx) mx <- v;}` where `nrm` is euclidian distance and `mx` stores the maximum). But with 10000 samples, the maximum distance I get with $k=1000$ is 13.8. With 100000 samples, it ekes out a bit more with 14.0. Theoretically, the maximum distance possible is $\sqrt{1000}\approx 31.6$ and with an infinite number of samples, I guess I will get very close to this value. And the expected value of pairwise distance would be $1/\sqrt{6}$. But not with the sample maximum.2011-09-14
  • 0
    Still no idea why the maximum should be related to the sample mean--it is not--nor why you care. For the behaviour of the maximum (which is an altogether different question), see Edit.2011-09-14
  • 1
    @JasonMond: What is your concern with the maximum? And, but the way, a **much** more efficient way to generate your sample in $R$ is to do: `v<-replicate(10000,sqrt(sum((runif(1000)-runif(1000))^2))`.2011-09-14
  • 0
    @Didier I saw this problem in a CS class where the instructor was demonstrating counter-intuitive behaviors in high dimensions. And I couldn't remember if he said he normalized by the sample maximum distance or the theoretical maximum distance. I just assumed that with a large enough sample, the sample maximum and the theoretical maximum would be very close. You showed me it's not. I'll have to go through your added solution several times to fully grasp it. But I do have a rough idea of what you are saying. Thanks!2011-09-14
1

This probably isn't anything nice for general $k$. You're asking for $$ E \sqrt{ \sum_{i=1}^k (X_i-Y_i)^2/k } $$ where $X_i$ and $Y_i$ are independent uniform(0,1). However, let $Z_i = (X_i-Y_i)^2$. Now you're looking for $$ E \sqrt{ (\sum_{i=1}^k Z_k)/k }$$ and let's think about what happens when $k$ gets large. When $k$ is large, by the law of large numbers $(\sum_{i=1}^k Z_k)/k$ will be concentrated around $E(Z_k)$. Therefore as $k$ gets large I expect that the answer approaches $\sqrt{E(Z_k)}$, which is $\sqrt{1/6}$.