I am reading a section of a book regarding linear regression and came across a derivation that I could not follow.
It starts with a loss function:
$\mathcal{L}(\textbf{w},S) = (\textbf{y}-\textbf{X}\textbf{w})^\top(\textbf{y}-\textbf{X}\textbf{w})$
and then states that "We can seek the optimal $\textbf{w}$ by taking the derivatives of the loss with respect to $\textbf{w}$ and setting them to the zero vector"
$\frac{\partial\mathcal{L}(\textbf{w},S)}{\partial\textbf{w}} = -2\textbf{X}^{\top}\textbf{y} + 2\textbf{X}^\top\textbf{X}\textbf{w} = \textbf{0}$
How is this derivative being calculated? I find that I have no idea how to take the derivative of vector or matrix valued functions, especially when the derivative is with respect to a vector, however I found a pdf ( http://orion.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf ) which appears to address some of my questions, yet my attempts at taking the derivative of the loss function seem to be missing a transpose and thus does not reduce as nicely as the books result.