In linear regression, the loss function is expressed as
$$\frac1N \left\|XW-Y\right\|_{\text{F}}^2$$
where $X, W, Y$ are matrices. Taking derivative w.r.t $W$ yields
$$\frac 2N \, X^T(XW-Y)$$
Why is this so?
In linear regression, the loss function is expressed as
$$\frac1N \left\|XW-Y\right\|_{\text{F}}^2$$
where $X, W, Y$ are matrices. Taking derivative w.r.t $W$ yields
$$\frac 2N \, X^T(XW-Y)$$
Why is this so?
Let
$$\begin{array}{rl} f (\mathrm W) &:= \| \mathrm X \mathrm W - \mathrm Y \|_{\text{F}}^2 = \mbox{tr} \left( (\mathrm X \mathrm W - \mathrm Y)^{\top} (\mathrm X \mathrm W - \mathrm Y) \right)\\ &\,= \mbox{tr} \left( \mathrm W^{\top} \mathrm X^{\top} \mathrm X \mathrm W - \mathrm Y^{\top} \mathrm X \mathrm W - \mathrm W^{\top} \mathrm X^{\top} \mathrm Y + \mathrm Y^{\top} \mathrm Y \right)\end{array}$$
Differentiating with respect to $\mathrm W$,
$$\nabla_{\mathrm W} f (\mathrm W) = 2 \, \mathrm X^{\top} \mathrm X \mathrm W - 2 \, \mathrm X^{\top} \mathrm Y = 2 \, \mathrm X^{\top} \left( \mathrm X \mathrm W - \mathrm Y \right)$$
Let $X=(x_{ij})_{ij}$ and similarly for the other matrices. We are trying to differentiate $$ \|XW-Y\|^2=\sum_{i,j}(x_{ik}w_{kj}-y_{ij})^2\qquad (\star) $$ with respect to $W$. The result will be a matrix whose $(i,j)$ entry is the derivative of $(\star)$ with respect to the variable $w_{ij}$.
So think of $(i,j)$ as being fixed now. Only some of the terms in $(\star)$ depend on $w_{ij}$. Taking their derivative gives $$ \frac{d\|XW-Y\|^2}{dw_{ij}}=\sum_{k}2x_{ki}(x_{ki}w_{ij}-y_{kj})=\left[2X^T(XW-Y)\right]_{i,j}. $$