I want to comprehend the derivative of the cost function in linear regression involving Ridge regularization, the equation is:
$L^{\text{Ridge}}(\beta) = \sum_{i=1}^n (y_i - \phi(x_i)^T\beta)^2 + \lambda \sum_{j=1}^k \beta_j^2$
Where the sum of squares can be rewritten as:
$L^{}(\beta) = ||y-X\beta||^2 + \lambda \sum_{j=1}^k \beta_j^2$
For finding the optimum its derivative is set to zero, which leads to this solution:
$\beta^{\text{Ridge}} = (X^TX + \lambda I)^{-1} X^T y$
Now I would like to understand this and try to derive it myself, heres what I got:
Since $||x||^2 = x^Tx$ and $\frac{\partial}{\partial x} [x^Tx] = 2x^T$ this can be applied by using the chain rule:
\begin{align*} \frac{\partial}{\partial \beta} L^{\text{Ridge}}(\beta) = 0^T &= -2(y - X \beta)^TX + 2 \lambda I\\ 0 &= -2(y - X \beta) X^T + 2 \lambda I\\ 0 &= -2X^Ty + 2X^TX\beta + 2 \lambda I\\ 0 &= -X^Ty + X^TX\beta + 2 \lambda I\\ &= X^TX\beta + 2 \lambda I\\ (X^TX + \lambda I)^{-1} X^Ty &= \beta \end{align*}
Where I strugle is the next-to-last equation, I multiply it with $(X^TX + \lambda I)^{-1}$ and I don't think that leads to a correct equation.
What have I done wrong?