2
$\begingroup$

I found a derivation of Normal Equation without the use of calculus from Wikipedia (Linear Least Squares). This method rewrites the Residual Sum Squared Error:

$$S(\beta)=y^{T}y-2\beta X^{T}y+\beta^{T}X^{T}X\beta$$

into:

$$S(\beta)=\left \langle \beta, \beta \right \rangle-2\left \langle \beta, (X^{T}X)^{-1}X^{T}y \right \rangle+\left \langle (X^{T}X)^{-1}X^{T}y, (X^{T}X)^{-1}X^{T}y \right \rangle+C,$$

and $\langle \cdot ,\cdot \rangle$ is the inner product defined by $$ \langle x,y\rangle =x^{\rm {T}}(\mathbf {X} ^{\rm {T}}\mathbf {X} )y. $$

I understand the idea is to rewrite $S(\beta)$ into the form of $S(\beta)=(x-a)^2+b$ such that $x$ can be solved exactly. But I do not understand how to rewrite $S(\beta)$ and what principals are used to rewrite $S(\beta)$?

  • 0
    I missed the definition of the inner product in the original question.2017-02-27

2 Answers 2

2

You missed an important detail:

and $\langle \cdot ,\cdot \rangle$ is the inner product defined by $$ \langle x,y\rangle =x^{\rm {T}}(\mathbf {X} ^{\rm {T}}\mathbf {X} )y $$

With that, you can expand $$ \left \langle \beta, \beta \right \rangle-2\left \langle \beta, (X^{T}X)^{-1}X^{T}y \right \rangle+\left \langle (X^{T}X)^{-1}X^{T}y, (X^{T}X)^{-1}X^{T}y \right \rangle = \\ \beta^TX^TX\beta -2\beta^T X^TX (X^{T}X)^{-1}X^{T}y + [(X^{T}X)^{-1}X^{T}y]^TX^TX (X^{T}X)^{-1}X^{T}y $$ and verify that this is what we started with.


(old answer below)

If we make the substitution $\gamma = X\beta$. Since $X^TX$ is positive definite, $X$ has full column-rank, so there exists a matrix $M$ with $MX = I$, so that $\beta = MX\beta = M\gamma$. In particular, let's take $M = (X^TX)^{-1}X^T$.

We then have $$ S(\gamma) = y^{T}y-2\beta^T X^{T}y+\beta^{T}X^{T}X\beta = \\ y^Ty + 2(M\gamma)^T X^Ty + \gamma^T\gamma = \\ \gamma^T\gamma + 2\gamma^T M^TX^Ty + y^Ty = \\ \langle \gamma, \gamma \rangle + 2\langle \gamma,M^TX^T y \rangle + \langle y,y \rangle =\\ \langle \gamma, \gamma \rangle + 2\langle \gamma,X(X^TX)^{-1}X^T y \rangle + \langle y,y \rangle $$

1

I think you may be getting bogged down in a lot of notation just for the sake of brevity, without really saving yourself any time. Let's use a bit more words to understand what's going on.

Linear least squares boils down to the problem of minimizing $\| Ax-b \|^2$, where $b$ is a given $m$-dimensional vector and $A$ is a given $m \times n$ matrix usually with $m>n$. The geometric idea is that this can only happen if $Ax-b$ is orthogonal to every vector in the column space of $A$, i.e. every vector of the form $Ay$ for some $n$-dimensional vector $y$. Thus:

$$(Ay)^T (Ax-b)=0 \\ y^T A^T (Ax-b)=0.$$

Taking $y$ to be the standard basis vector $e_i$ for each $i=1,\dots,n$ gives $A^T (Ax-b)=0$ which is the normal equations.

Now where did this orthogonality come from? You can see it a few different ways, but here's one. Given an $x$ satisfying the normal equations, take any other $y$. Then the Pythagorean theorem gives $\| Ay-b \|^2 = \| A(x-y) \|^2 + \| Ax-b \|^2 \geq \| Ax-b \|^2$. Moreover if $A$ has full column rank (which is usually the case), then $\| A(x-y) \|^2>0$ so that $ \| Ay-b \|^2>\| Ax-b \|^2$.