4
$\begingroup$

Select $n$ numbers from a set $\{1,2,...,U\}$, $y_i$ is the $i$th number selected, and $x_i$ is the rank of $y_i$ in the $n$ numbers. The rank is the order of the a number after the $n$ numbers are sorted in ascending order.

We can get $n$ data points $(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)$, And a best fit line for these data points can be found by linear regression. $r_{xy}$ (correlation coefficient) is the goodness of the fit line, I want to calculate $E(r_{xy})$ or $E(r_{xy}^2)$ (correlation of determination).

  • 0
    It seems unlikely that there is a nice formula for $E(r_{xy})$. For instance, when $U=6$ and $n=3$ my calculations give ${3\over 10}+{9\sqrt{21}\over 140} +{2\sqrt{39}\over 65} + {\sqrt{7}\over 28}+{\sqrt{57}\over 76}$. It seems that the average correlation is quite large, usually more than $9/10$.2011-04-13

1 Answers 1

1

Just a hint, a probability approach:

One can compute $P(x |y) $ : given the value of a extracted number $y=1 \ldots u$, probability that its rank (among the $n$ numbers) is $x=1 \ldots n$.

$\displaystyle P(x |y) = \frac{ {y - 1 \choose x - 1} {u - y \choose n-x} }{ {u-1 \choose n-1} } , \; \; n-u+y \le x \le y $

From this one can (formally or numerically; analytically... I doubt it) compute $E(x|y) = \sum x P(x |y)$

And then we could compute $E(x \; y) = E_y ( y \; E_x ( x | y ) )$

And $Cov(x y) = E(x \; y) - E(x) E(y) = E(x \; y) - \frac{n+1}{2} \frac{u+1}{2}$

  • 0
    I have calculated the Rxy using the above formula - the above one is population correlation coefficient, and I also use a random generated data to calculate the sample correlation coefficient. It seems the population correlation coefficient is always smaller than the sample correlation coefficient.2011-05-17