2
$\begingroup$

A horizontal regression is defined as the following:

$m=\frac{\sum_{i=1}^n (x_i-\operatorname{average(x)})(y_i-\operatorname{average(y))}}{\sum_{i=1}^n (x_i-\operatorname{average(x)})^2}$

whereas a vertical regression is defined as

$m=\frac{\sum_{i=1}^n (y_i-\operatorname{average(y)})^2}{\sum_{i=1}^n (x_i-\operatorname{average(x))}(y_i-\operatorname{average(y))}}$

In several math-books it says that you use the horizontal regression if you want to calculate the x values to given y values and the vertical regression formula if you want to find the corresponding y values to given x values.

But how can it be that the function I get with the vertical regression formula is more accurate than the horizontal one for x on y values?

For example:

$ x := \{1,2,3,4,5,6,7,8\}$

$ y := \{0.3,0.5,0.7,1,1,1.5,1.6,2.1\}$

That gives me the vertical function: $f(x)=0.24404*x-0.0107$ and the horizontal function: $f(x)=0.25256*x-0.04902$

If I calculate the least-squares-sum (x on y):

$\sum_{i=0}^7 (x_i-f(y_i))^2$

I get 181.33... for the vertical one but 183.14... for the horizontal one.

Why is in general the horizontal regression associated with "x on y" if the vertical one obviously be more accurate?

Thanks a lot in advance!

  • 0
    I went over my work several times... I've also double checked everything with Mathcad. Results are still the same. The example I've given above leads me to exactly the results I've stated (181.33 for vertical but 183.14 for the horizontal)2012-06-03

2 Answers 2

2

I see the problem. You've computed linear fits of the form $y=f(x)$, and then you're comparing $x$ with $f(y)$! But of course, if $y=f(x)$, then $x$ is $f^{-1}(y)$, not $f(y)$. This is why your sum of squared residuals are so large for such small inputs. If you want to compare residuals in $x$, you need to compute $\sum \left(x_i - f^{-1}(y_i)\right)^2$ instead.

The inverse functions of the "vertical" and "horizontal" regressions are $x \approx f^{-1}(y)=4.09756y+0.0439024$ and $x \approx f^{-1}(y)=3.95944y+0.194109$ respectively. The respective sums of squared residuals in $x$ are $1.46513$ and $1.41574$. As you can see, the horizontal regression does better.

0

The idea is that one variable (usually $x$) is a value you input-maybe the temperature or concentration of something-which you know accurately. The other one (usually $y$) is less well known as it is something you measure with errors. The fit then assumes the $x$ values are exact and minimizes the sum of the squares of the errors in $y$. If your data fit a straight line well, the difference will be small. If not, it will be large. This is the origin of the correlation coefficient equation.

  • 0
    Then there is the regression that minimizes the sum of the squares of the distances to the regression line itself. This is independent of the coordinate system, and can be done with a few additional statistics.2012-07-03