2
$\begingroup$

\begin{equation}\frac{d}{d\theta}\frac{1}{2}(\theta^TX - y)^2 = 0\end{equation} where, $X$ is $m $ on $ n$ matrix, $y$ is $m$-dimensional vector, $\theta$ is n-dimensional vector.

I can solve this equation, but only intuitively, because of I know solution from lecture I watched: \begin{equation} \theta = (X^TX)^{-1}X^Ty \end{equation}

Could someone solve it step by step and explain transformation?

Below there is my solution, but there are mistakes for sure:

\begin{equation} \frac{d}{d\theta}\frac{1}{2}(\theta^TX - y)^2 = 0 \\ (\theta^TX - y)X = 0 \\ \theta^TXX - X^Ty = 0 \\ X^T\theta X = X^Ty \\\ \theta = (X^TX)^{-1}X^Ty \end{equation}

And how I said it's so intuitively. In each line I see something, what doesn't fit to me. Some of transformation I did so I could reach expected solution.

1 Answers 1

1

I refuse to answer this question without adding the following comment: one common reason for confusion when it comes to take the derivative of equations from linear algebra is a poor or inconsistent choice of notation. This is the case here, too. If you take the transpose of an $n$- vector you get a $(1,n)$ -matrix. If you apply this from the left to an $(n,m)$ matrix you get $(1, m)$ matrix, an object from which you cannot substract an $m$- vector. The result of this kind of inconsistency in notation are more inconsistencies when you perform operations on these constructions, like taking the derivative. Consequently your computation contains some lines which are not really well defined, eg. if they contain products $XX$ -- you do not assume $X$ to be a square matrix, so this is not defined.

Another choice of notation which can turn inta a trap is using the square notation for the scalar product of a vector with itself, like $v^2 = \langle v, v\rangle = v^T v$ While $v^2$ is common, it is easier to use one of the other forms when it comes to take the derivative.

I'm also not too happy with the notation $\frac{d}{d\theta}$, since $\theta$ is a vector. I prefer something like $D_v f(\theta)$, denoting the directional derivative of $f(\theta)$. Now if $f(\theta) = X\theta$ you just get $D_v f = Xv$. (assuming $X$ is independent of $\theta$).

Combining the comments: what you could look at is either $\langle X\theta - y, X\theta -y\rangle $ or $\langle (\theta^T X)^T - y, (\theta^T X)^T -y\rangle $, depending on how $X$ looks like. I assume you are interested in the first variant. Then, taking the derivative wrt to a vector $v$ you get

$D_v \frac{1}{2}\langle X\theta-y,X\theta-y\rangle = \frac{1}{2}(\langle Xv,X\theta -y\rangle + \langle X \theta -y, Xv\rangle = \langle Xv,X \theta -y\rangle = \langle v,X^TX\theta-X^Ty\rangle $ This will be $=0$ for any $v$, so the second factor needs to be $=0$, hence $X^TX\theta = X^Ty \Rightarrow \theta = (X^TX)^{-1}X^T y$

(assuming $X^TX$ is invertible ...) which miraculously is the result you were expecting.

  • 0
    I did omit the equation from which you started out. There you require that the derivative is $=0$. This, when applied to the line where I calculate $D_v$ of that expression, amounts to saying that $\langle v, X^TX \theta- X^T y\rangle = 0$ Since $v$ is just any vector, this equation has to be true for any such $v$, which is only possible if the first expression in the last equation is zero. The only manipulation in that line is multiplication from the left by $(X^T X)^{-1}$2012-06-10