2
$\begingroup$

I have a vector X of 50 real numbers and a vector Y of 50 real numbers.

I want to model them as

y = ax + b

How do I determine a and b such that it minimizes the square of the error to this training set?

That is given

X = (x1,x2,...,x50)
Y = (y1,y2,...,y50)

What is the closed form for

a = ???
b = ???

See also: https://codereview.stackexchange.com/questions/10122/c-correlation-leastsquarescoefs

2 Answers 2

2

Let $\overline{x} = (x_1+\cdots+x_n)/n$ and $\overline{y}=(y_1+\cdots+y_n)/n$ be the averages of the $x$- and $y$-values. The least squares line will always pass through the point $(\overline{x},\overline{y})$.

The remaining problem is: what is the slope?

Let $s_x=\sqrt{(1/n)\sum_{i=1}^{50} (x_i-\overline{x})^2}$ and $s_y=\sqrt{(1/n)\sum_{i=1}^{50} (y_i-\overline{y})^2}$ be the standard deviations of the $x$- and $y$-values. (You'll sometimes find it said that you should divide by $n-1$ rather than $n$, but that won't make any difference here since we'll be dividing $s_y$ by $s_x$, so the fraction $1/n$ or $1/(n-1)$ will cancel out.

How many standard deviations is $x$ away from the average $x$-value $\overline{x}$? The answer is $\dfrac{x-\overline{x}}{s_x}$. How many standard deviations should the corresponding $y$-value be from the average $y$-value? The answer comes from multiplying the fraction above by the correlation $\rho$, which is $$ \rho = \frac{\sum_{i=1}^{50} (x_i-\overline{x})(y_i-\overline{y})}{\sqrt{\left(\sum_{i=1}^{50} (x_i-\overline{x})^2\right)\left(\sum_{i=1}^{50} (y_i-\overline{y})^2\right)}}. $$ Thus the $y$-value should be $\rho\dfrac{x - \overline{x}}{s_x}$ standard deviations above the average $\overline{y}$. One standard deviation is $s_y$, so multiply that by the foregoing, getting $\rho s_y\dfrac{x-\overline{x}}{s_x}$.

Bottom line (no pun intended....): The line is: $$ y - \overline{y} = \rho \frac{s_y}{s_x} (x-\overline{x}). $$

If you want it in the form $y=ax+b$, then this says: $$ y = \left(\rho \frac{s_y}{s_x}\right) x + \left( \overline{y} - \rho\frac{s_y}{s_x}\overline{x}\right). $$ In other words, $a=$ the first expression in parentheses above, and $b=$ the second.

  • 0
    I've written code for this at: http://codereview.stackexchange.com/questions/10122/c-correlation-leastsquarescoefs. Can you check it please.2012-03-18
1

Let $A=\sum{x_i^2}, B=\sum{x_i}, C=\sum{x_i y_i}, D=\sum{y_i}$.

Then $b = {AD - BC \over nA - B^2}$, and $a={C - bB \over A}$.

This should work, although I can't guarantee that it's the fastest way to do the calculations.

  • 0
    I've written code for this at: http://codereview.stackexchange.com/questions/10122/c-correlation-leastsquarescoefs. Can you check it please.2012-03-18