3
$\begingroup$

Randomly uniformly select $n$ numbers from a set $\{1,2,...,U\}$ with/without replacement, $y_i$ is the $i$th number selected, and $x_i$ is the rank of $y_i$ in the $n$ numbers. The rank is the order of a number after the $n$ numbers are sorted in ascending order.

We can get $n$ data points $(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)$, And a best fit line for these data points can be found by linear regression with the object function $L_{\infty}$ norm. I want to get the bound of its object function.

The object function is to minimize $F(\alpha,\beta)=\max\limits_{1\leq i\leq n} |y_i-\alpha-\beta x_i|$

By using linear regression we can find a $F_{\min}$ that is minimized.

I want a distribution for $F_{\min}$ with respect to different selection of the list, or $\Pr [|F_{\min}-k|>t]$ for some $k$ and $t$.

  • 0
    Sorry, should be $(k,y)$ and $(k+1,y)$. Got confused as to which was the rank and which the value.2011-09-15

1 Answers 1

0

Okay, let's work backwards. Suppose we have $\alpha, \beta$ that minimizes $\max_i | y_i - \alpha x_i - \beta |$, and let us suppose that for this $\alpha, \beta$ the absolute error is maximized at $x_i = i$. We first point out that there must be some other $j \neq i$ such that $|y_j - \alpha x_j - \beta| = |y_i - \alpha x_i - \beta |$. Otherwise, we are free to change $\beta$ to reduce this error (and still have it be the max error). Furthermore, there must be a third index $k \neq i,j$ with the same absolute error, otherwise we could change $\alpha$ or $\beta$ to reduce both of the errors for $i,j$ simultaneously (change $\beta$ if both errors have the same sign, change $\alpha$ if the signs are different). Let $i. If we assume that no other indices have the same absolute error, then the signs of the errors must be either $+,-,+$ or $-,+,-$. If there are other indices with the same absolute error, then there must be some subsequence of error signs of that type.

Let us assume the $+,-,+$ sequence of error signs for now. Then we have

$ y_i - \alpha * i - \beta = \beta + \alpha * j - y_j = y_k - \alpha * k - \beta $

or

$ \alpha = \frac{y_k - y_i}{k-i} $ and $\beta = \frac{y_i + y_j}{2} - \frac{\alpha (i+j)}{2} = \frac{y_k + y_j}{2} - \frac{\alpha (k+j)}{2} = \frac{(j+k)y_i + (k-i)y_j - (i+j)y_k}{2(k-i)}$

and the maximum error is $F_{\min} \equiv \frac{(k-j)y_i + (i-k)y_j + (j-i)y_k}{2(k-i)}$.

Since this is the maximum error, we also have

$ \beta + \alpha * m - F_{\min} \leq y_m \leq \beta + \alpha *m + F_{\min}$ for all $m$.

The same formulas hold for the $-,+,-$ sign choice. This represents $(j,y_j)$ lying either above or below the line between $(i,y_i)$ and $(k,y_k)$ and all other points lying in the region defined by that line and the parallel line through $(j,y_j)$.

Now, the probability distribution for $F_{\min}$ is not too difficult to solve in an asymptotic limit: $U\rightarrow \infty, n/U \rightarrow 0$ drawing with or without replacement.

In either of these cases, we replace $y_i,y_j,y_k$ by $Y_i,Y_j,Y_k = y_i/U, y_j/U, y_k/U$. $F_{\min} = U f(Y_i,Y_j,Y_k)$ occurs with weight $(2 f(Y_i,Y_j,Y_k))^{n-3} * P(Y_i $ is the $i$th smallest draw and $Y_j $ is the $j$th smallest draw and $Y_k $ is the $k$th smallest draw$)$. We then divide by the sum/integral of all of the weights. The CDF of $P()$ is something like $(Y_i)^i (1-Y_k)^{n-k} (Y_j - Y_i)^{j-i} (Y_k - Y_j)^{k-j}$. Not a pretty integral. Furthermore, this ratio is going to give you the joint probability density of $(i,j,k,Y_i,Y_j,Y_k)$, not actually the probability density of $F_{\min}$. To get that, you would need to do another integration over that set of $(i,j,k,Y_i,Y_j,Y_k)$ that gives you a fixed value of $F_{\min}$.

  • 0
    Okay. There's a mistake here. The probability that $y_m$ lies in the correct region is not independent of the probability that $y_{m'}$ lies in the correct region. Whoops! Well, this is a starting point at least. Maybe not a very good one though.2011-09-15