15
$\begingroup$

I am trying to compute the derivative:$$\frac{d}{d\boldsymbol{\mu}}\left( (\mathbf{x} - \boldsymbol{\mu})^\top\boldsymbol{\Sigma} (\mathbf{x} - \boldsymbol{\mu})\right)$$where the size of all vectors ($\mathbf{x},\boldsymbol{\mu}$) is $n\times 1$ and the size of the matrix ($\boldsymbol{\Sigma}$) is $n\times n$.

I tried to break this down as $$\frac{d}{d\boldsymbol{\mu}}\left( \mathbf{x}^\top\boldsymbol{\Sigma} \mathbf{x} - \mathbf{x}^\top\boldsymbol{\Sigma} \boldsymbol{\mu} - \boldsymbol{\mu}^\top\boldsymbol{\Sigma} \mathbf{x} + \boldsymbol{\mu}^\top\boldsymbol{\Sigma} \boldsymbol{\mu} \right) $$

yielding $$(\mathbf{x} + \boldsymbol{\mu})^\top\boldsymbol{\Sigma} + \boldsymbol{\Sigma}(\boldsymbol{\mu} - \mathbf{x})$$

but the dimensions don't work: $1\times n + n\times 1$. Any help would be greatly appreciated.

-C

  • 0
    Why did you write $u$ instead of $\mu$ for the derivative?2011-12-27
  • 0
    Mistake: now fixed2011-12-27
  • 0
    I guess $\Sigma$ here is supposed to be symmetric?2011-12-28
  • 0
    You probably meant the *inverse* of sigma in the original espression.. It's the Mahalanobis distance.2017-06-03
  • 0
    What is the broader context of this expression?2017-11-21

3 Answers 3

9

Think of when you compute the derivative with $n = 1$ : you get $(x-\mu)^{\top} \Sigma (x-\mu) = \sigma (x-\mu)^2$ for some constant $\sigma$ representing the matrix, thus derivative with respect to $\mu$ gives $2 \sigma(\mu-x)$. This is what happens in dimension $n$ : the derivative of this function is the gradient seen as a function of $\mu$. Write $x = (x_1, \dots, x_n)$, $\mu = (\mu_1, \dots, \mu_n)$ and $\Sigma = (\sigma_{ij})$. Then $$ f(\mu) = (x-\mu)^{\top} \Sigma (x-\mu) = \sum_{i=1}^n \sum_{j=1}^n (x_i - \mu_i)(x_j - \mu_j)\sigma_{ij}. $$ Computing partial derivatives, say, with respect to the $k^{\text{th}}$ variable, with $1 \le k \le n$, you get $$\begin{align*} \frac{\partial f}{\partial\mu_{k}} &= \sum_{i=1}^n \sum_{j=1}^n \left( -\delta_{ik} (x_j - \mu_j) \sigma_{ij} \right) + \left( (x_i -\mu_i) (-\delta_{jk}) \sigma_{ij} \right)\\ &= \sum_{j=1}^n (\mu_j - x_j) \sigma_{kj} + \sum_{i=1}^n (\mu_i - x_i) \sigma_{ik}, \end{align*}$$ where $\delta_{ij} = 0$ if $i \neq j$ and $1$ if $i=j$. If you look at the vector $\nabla f = \left( \frac{ \partial f}{\partial \mu_1} , \dots, \frac{ \partial f}{\partial \mu_n}\right)$, you see that its components are precisely those of the vector $\Sigma(\mu-x) + \Sigma^{\top} (\mu-x)$. If the matrix $A$ is symmetric, you get $2\Sigma(\mu-x)$.

Hope that helps,

  • 1
    Thank you, your answer reminded me how to think about these types of derivatives2011-12-28
  • 1
    No problem =) It is my pleasure. Just remember that there are different ways to define derivatives when in higher dimensional real vector spaces.. and in $\mathbb R^n$ the standard definition of differentiability leads to computing the gradient. (There are other ways to say you are differentiating, for instance with respect to a direction, or like I just did, partial derivatives.)2011-12-28
  • 0
    I attended a second-year course before I got to university because I was curious... and I was recommended one particular teacher because her course was awesome. On the first day, she was precisely speaking of derivatives and she taught us how to differentiate vector functions. This is was I learned on the first day and it is also how she taught us to think of it! A very good intuitionnist. I love that teacher =) and I'll always remember this trick this way!2011-12-28
  • 0
    @PatrickDaSilva I really like this answer, so I would actually really want to learn this trick myself for general matrices and vectors. Could you provide me with any reference from which I could learn this?2013-10-18
  • 0
    @PatrickDaSilva especially some more examples would be great!2013-10-18
  • 0
    @rbm : What are you looking for exactly? Do you want to understand better how to compute the gradient of functions written in matrix form or something?2013-10-18
  • 0
    @PatrickDaSilva I want to understand better how to do the whole procedure that you described above, so that I don't just need to rely on all the standard rules for matrix derivatives that I use now. So basically anything where this method is described would be great!2013-10-18
  • 0
    @rbm : Try differentiating these functions : $f(x) = \frac{x^{\top} Ax}{\|x\|^2}$, $g(x) = \frac{b \cdot x}{\|x\|^4}$ (of course the domain of $f$ and $g$ is $\mathbb R^n \backslash \{ 0 \}$). To do it, try going through the steps of my answer, i.e. use the chain rule. If you are stuck, post a question on MSE and I will be more than glad to answer it. There is no "guide" on how to do it. It's all about using the chain rule and computing the $n$ partial derivatives (of course the partial derivatives will look the same up to an index so you really just compute once).2013-10-19
  • 0
    @PatrickDaSilva Many thanks :)! I will try it out!2013-10-19
  • 0
    @PatrickDaSilva Could you maybe help me with this question: http://math.stackexchange.com/questions/606646/matrix-derivative-ax-btax-b ? Thanks a lot in advance!2013-12-16
8

There is a very short and quick way to calculate it correctly. The object $(x-\mu)^T\Sigma(x-\mu)$ is called a quadratic form. It is well known that the derivative of such a form is (see e.g. here),

$$\frac{\partial x^TAx }{\partial x}=(A+A^T)x$$

This works even if $A$ is not symmetric. In your particular example, you use the chain rule as,

$$\frac{\partial (x-\mu)^T\Sigma(x-\mu) }{\partial \mu}=\frac{\partial (x-\mu)^T\Sigma(x-\mu) }{\partial (x-\mu)}\frac{\partial (x-\mu)}{\partial \mu}$$

Thus,

$$\frac{\partial (x-\mu)^T\Sigma(x-\mu) }{\partial (x-\mu)}=(\Sigma +\Sigma^T)(x-\mu)$$

and

$$\frac{\partial (x-\mu)}{\partial \mu}=-1$$

Combining equations you get the final answer,

$$\frac{\partial (x-\mu)^T\Sigma(x-\mu) }{\partial \mu}=(\Sigma +\Sigma^T)(\mu-x)$$

3

In full technicality:

$$\frac{\partial}{\partial u_k}\left(\sum_{i,j=1}^n \sigma_{ij}(x_i-u_i)(x_j-u_j)\right)=\sum_{i,j=1}^n\sigma_{ij}\left[-\delta_{ik}(x_j-u_j)-(x_j-u_j)\delta_{jk}\right]$$

$$=-\sum_{l=1}^n (\sigma_{kl}+\sigma_{lk})(x_l-u_l)=\left[(\Sigma+\Sigma^T)(\vec{u}-\vec{x})\right]_k.$$

IOW you should have gotten a $\Sigma^T(u-x)$ instead of $(x+u)^T\Sigma$. Note $a^T\Phi b=b^T\Phi^T a$.

  • 0
    In the last displayed equation, one should replace the vector $(\Sigma+\Sigma^T)(\vec{u}-\vec{x})$ by its $k$th coordinate.2011-12-28
  • 0
    @Didier: Well, yes, I guess - I'm using $a_i$ and $\vec{a}$ interchangeably here.2011-12-28
  • 0
    Your comment is not related to my remark. Let me try once again: the LHS of your displayed equation is a number, aka a matrix of size $1\times1$, while the RHS is a matrix of size $n\times1$. Hence they are not equal.2011-12-28
  • 0
    Didier is right : you obviously know what you're doing but there's a typo out there. At the last equality you have written a vector, but equality stands if you replace that vector by its $k^{\text{th}}$ component.2011-12-28
  • 0
    @Didier, Patrick: Sorry, let me explain. I sometimes write $a_i$ to *stand* for $\vec{a}$ (as $k$ ranges over its various indices). Apparently this is very particular to me...2012-01-03