1
$\begingroup$

Why is it that the derivative of the sum of squares of a vector, w: \begin{eqnarray} \frac{\lambda}{2n} \sum_w w^2, \end{eqnarray}

turns out to be

\begin{eqnarray} \frac{\lambda}{n} w \end{eqnarray}

and not

\begin{eqnarray} \frac{\lambda}{n} \sum_w w \;? \end{eqnarray}

Basically as I see it, we've got

\begin{eqnarray} w = [w_1, w_2, w_3 ...] \end{eqnarray}

\begin{eqnarray} \frac{d}{dw} \frac{\lambda}{2n} \sum_w w^2 = \frac{\lambda}{2n}(\frac{\partial}{\partial w_1} \sum_w w^2 + \frac{\partial}{\partial w_2} \sum_w w^2 + \frac{\partial}{\partial w_3} \sum_w w^2 ...) \end{eqnarray}

\begin{eqnarray} = \frac{\lambda}{n} (w_1 + w_2 + w_3 ...) \end{eqnarray}

\begin{eqnarray} = \frac{\lambda}{n} \sum_w w \end{eqnarray}

I'm following this ebook here (equations 87/88, which are basically the same as what I've written above). The main thing I don't understand is why we can eliminate the summation. Any math books or writeups on the subject would also be helpful.

  • 0
    Please transcribe the equations directly in your post. While I clicked on the link you give, I am not about to scroll through umpteen or more pages to figure out what, exactly, you are asking.2017-02-18
  • 0
    The equations are basically the same as what I've written, I added a note about it.2017-02-18
  • 0
    That ebook uses Mathjax to format its equations, the same markup that this site uses, so all you needed to do to copy the equations here would be to copy the Mathjax source the ebook already uses. I got it by right-clicking on an equation in FireFox on a Windows machine and selecting "Show Math As">"TeX Commands". I edited the original equations down to remove the equation numbers and the terms that you weren't asking about, but I left in the factor of $\frac1n.$ You can edit them to your own taste now.2017-02-18
  • 0
    When you take the derivative, it's an operation on functions that are defined on the continuum. What does it mean to take the derivative of such a function and also sum over $w$? What's the index of the sum?2017-02-18

2 Answers 2

1

If there are actually $m$ input variables, you write sum in the Equation $87$ in the ebook in the notation $$ \sum_{i=1}^m w_i^2, $$ and it can be viewed as a function of the $m$ variables $w_1, \ldots, w_m.$ The "derivative" in the ebook is a partial derivative, which deals with how the function value would change if you could slightly increase or decrease just one of the $m$ input variables while leaving all the others unchanged. The notation $\frac{\partial}{\partial w}$ in the ebook means the same thing as you would recognize in the $\frac{\partial}{\partial w_i},$ that is, it is a partial derivative with respect to one variable, but the ebook has chosen to let the letter $w$ by itself represent one of the $m$ variables rather than use a subscript.

The partial derivative of the sum of two functions is the sum of the partial derivatives, just like you are used to in the case of single-variable functions, but only when both partial derivatives are with respect to the same variable. The partial derivatives of different variables do not add up in the manner you imagine; and in any case, the ebook definitely means to take the partial derivative of one variable over the entire sum.

When we write $$ \frac{\partial}{\partial w_j} w_i^2, $$ the result is zero unless $i = j,$ because in a partial derivative $\frac{\partial}{\partial w_j}$ over the variables $w_1, \ldots, w_m,$ all the variables except $w_j$ act like constants. On the other hand, $$ \frac{\partial}{\partial w_j} w_j^2 = 2w_j, $$ because that describes how the function $w_j^2$ changes as we vary $w_j.$

To spell it out in gory detail, what you actually have is \begin{align} \frac{\partial}{\partial w_j} \frac{\lambda}{2n} \sum_{i=1}^m w_i^2 &= \frac{\lambda}{2n} \frac{\partial}{\partial w_j}\left( w_1^2 + \cdots + w_{j-1}^2 + w_j^2 + w_{j-1}^2 + \cdots + w_m^2 \right) \\ & = \frac{\lambda}{2n} \left(\frac{\partial}{\partial w_j}w_1^2 + \cdots + \frac{\partial}{\partial w_j}w_{j-1}^2 + \frac{\partial}{\partial w_j}w_j^2 + \frac{\partial}{\partial w_j}w_{j+1}^2 + \cdots + \frac{\partial}{\partial w_j}w_m^2 \right) \\ & = \frac{\lambda}{2n} \left(0 + \cdots + 0 + 2w_j + 0 + \cdots + 0\right) \\ & = \frac{\lambda}{n} w_j. \end{align}

  • 0
    Thanks! I totally get that part. What I don't understand is why we don't have a sum when we add up all the partial derivatives. Also, the *n* here is supposed to be number of training samples, not weights, so in the sum I would use a different letter than *n*.2017-02-18
  • 0
    I'll fix the $n$ in a moment. Also, I initially wrote this up before seeing your exposition of how you expected the derivatives to add up (which was _extremely_ useful info to add to the question, good job!), and I edited the answer a bit after seeing that.2017-02-18
  • 0
    Ok, so the big thing was I guess the author left out the index for the $m$, and instead used $w$ for both the weight vector and the index...confusing. And when we take the derivative over the entire vector w.r.t. each weight, we just get the vector back I guess, which I'm still not sure I'm understanding correctly, because the answer from the first equation in my question will be one number, whereas the answer from the second equation will be a 1x$m$ vector.2017-02-18
  • 0
    Or at least that's how it ends up working out when this is coded up as a program. The L2 penalty is one number, but the derivative of it w.r.t. the weights is a 1x$m$ vector.2017-02-18
  • 0
    Looking back over the ebook, I notice that $w$ still has a subscript in Eq 71, but the subscript has disappeared by the time we get to Eq 85. It seems fine to me to use a variable name without a subscript to represent _one_ of several variables over which a sum is taken, but I can see how the sudden switch from subscript notation to non-subscript notation could really throw someone off. I much prefer books that use a consistent notation from the start to the finish of the exposition of one topic.2017-02-18
1

I suppose it is because it is a partial derivative for a particular weight. The summation index is just omitted. Basically, in the book you mentioned: $$C = C_0 + \frac {\lambda}{2n}\sum_{i}{\omega_i}^2$$ And we are interested in $$\frac{\partial C}{\partial \omega_i}$$

  • 0
    Yes, I follow that much, but isn't the derivative of C w.r.t. all the weights going to end up being a sum of their partial derivatives? So I still don't see how the sum is removed.2017-02-18
  • 0
    I suppose that ${\partial \omega_j^2}/{\partial \omega_i}=0$ when $j!=i$2017-02-18
  • 0
    Yes, of course. I'll add more to the question so you can see what I mean.2017-02-18