3
$\begingroup$

I'm new here, so "Hi" to everyone :D

I got the following problem. I have the matrices $A$, $B$, $C$, $X$ and $Y$. All matrices are square (say n-by-n). In particular: - $A$ is full rank - $B$ is symmetric and (semi)definite positive; - $C$ is diagonal and definite positive; - $Y$ is diagonal and definite positive; - $X$ is diagonal ($X = \operatorname{diag}\{x_1, \ldots,x_n\}$) and it is the unknown matrix;

Then I have the following function: $f(X) = (A(B+X^{T}YX)^{-1}A^{T} + C)^{-1}$ (it may seem dumb to write $X^{T}$ since it is diagonal, but I think this is the best way to write it).

I would like to evaluate the derivative of the trace of $f(X)$ with respect to each $x_i$.

Any idea?

  • 1
    This seems a standard matrix calculus problem. Though the calculus may be a little complicated, the problem can still be solved by some standard arguments. You may refer to "Old and New Matrix Algebra Useful for Statistics", which list many useful facts on matrix calculus.2012-12-01
  • 0
    @Shiyu: Hi, thanks for the observation. Could you indicate me were I can found a good paper about these matrix manipulations?2012-12-02
  • 0
    You can refer to "Old and New Matrix Algebra Useful for Statistics" which is a good article about matrix calculus.2012-12-03

2 Answers 2

1

If we perturb an invertible matrix $M$ by a small $\Delta M$, the first-order change in $M^{-1}$ is given by $\Delta (M^{-1}) := (M+\Delta M)^{-1} - M^{-1} = -M^{-1} (\Delta M) M^{-1} + O(\|\Delta M\|^2)$. Now, consider $f(X) = (A(B+X^{T}YX)^{-1}A^{T} + C)^{-1}$. \begin{align} \Delta f(X) =&\Delta\left((A(B+X^{T}YX)^{-1}A^{T} + C)^{-1}\right)\\ \approx&-f(X)\ \Delta\left(A(B+X^{T}YX)^{-1}A^{T} + C\right) f(X)\\ =&-f(X)A \Delta\left((B+X^{T}YX)^{-1}\right) A^{T}f(X)\\ \approx&f(X)A(B+X^{T}YX)^{-1} \Delta\left(B+X^{T}YX\right) (B+X^{T}YX)^{-1}A^{T}f(X)\\ \approx&f(X)A(B+X^{T}YX)^{-1} \left((\Delta X)^{T}YX + X^TY\Delta X\right) (B+X^{T}YX)^{-1}A^{T}f(X). \end{align} Therefore \begin{align} \Delta\, \mathrm{trace}f(X) \approx&\mathrm{trace}\, f(X)A(B+X^{T}YX)^{-1} \left((\Delta X)^{T}YX + X^TY\Delta X\right) (B+X^{T}YX)^{-1}A^{T}f(X)\\ =&2\,\mathrm{trace}\, (\Delta X)^{T}YX (B+X^{T}YX)^{-1}A^{T}f(X)^2A(B+X^{T}YX)^{-1} \end{align} and in turn $$ \frac{d\mathrm{trace}f(X)}{dX} = 2YX (B+X^{T}YX)^{-1}A^{T}f(X)^2A(B+X^{T}YX)^{-1}. $$ This is the formula for a general square matrix $X$. For a diagonal $X$, simply take the diagonal of the above derivative.

  • 0
    You say this $(M+\Delta M)^{-1} - M^{-1} = -M^{-1} (\Delta M) M^{-1}$. How did you obtain this? Maybe using inversion lemma or something else?2012-12-01
  • 1
    @the_candyman: This is not really an equality, but a first-order approximation formula. For the proof, see the [relevant Wikipedia entry](http://en.wikipedia.org/wiki/Invertible_matrix#Derivative_of_the_matrix_inverse), for example. To avoid misunderstanding, I will add a trailing $O(\|\Delta X\|^2)$ to each line or change the equality sign to approximation sign when appropriate.2012-12-01
  • 0
    You may verify my answer numerically. First, generate random matrices $X,Y,A,B,C$ where $Y,B,C$ are symmetric (if $Y$ is asymmetric, modify it to $Y\leftarrow Y+Y^T$, etc.). Let $E_{ij}$ be the matrix whose entries are all zero, except the $(i,j)$-th entry is 1. Compute the central difference $trace(f(X+hE_{ij})-f(X-hE_{ij}))/(2h)$ for some small $h$ (say, $h=0.001$). This is the numerical derivative of $trace(f(X))$ w.r.t. $X_{ij}$. Its value should be close to the $(i,j)$-th entry of $2YX (B+X^{T}YX)^{-1}A^{T}f(X)^2A(B+X^{T}YX)^{-1}$.2012-12-01
  • 0
    Last question... assuming that X is an arbitrary rectangular matrix m-by-n, B is m-by-m and all other matrices are n-by-n, is still possible to do all calculation you posted?2012-12-03
  • 1
    @the_candyman The same reasoning applies and the formula remains unchanged if $X$ is rectangular. However, if $B,C$ or $Y$ are not symmetric, the second line in calculating $\Delta\, \mathrm{trace}f(X)$ will no longer hold and we have to stick to the first line. Consequently, the derivative formula will be twice as long as the current one.2012-12-03
  • 0
    Well, thank you for your precious help :D2012-12-04
1

Just use the chain rule, which states that $$ \frac{\partial g(\bf{U})}{\partial X_{i,j}} = \operatorname{trace}\left(\left(\frac{\partial g(\bf{U})}{\partial \bf{U}}\right)^T\frac{\partial \bf{U}}{\partial X_{i,j}}\right). $$

In your case, let $\bf{U} = A(B+X^{T}YX)^{-1}A^{T} + C$ and then $g(\bf{U}) = \operatorname{trace}(U^{-1})$. First, observe that $$ \frac{\partial g(\bf{U})}{\partial \bf{U}} = -\bf{U}^{-T}U^{-T}. $$ Next, we evaluate $\frac{\partial \bf{U}}{\partial X_{i,j}}$. \begin{align} \frac{\partial \bf{U}}{\partial X_{i,j}} &= \frac{\partial}{\partial X_{i,j}}(A(B+X^{T}YX)^{-1}A^{T} + C)\\ & = \frac{\partial}{\partial X_{i,j}}(A(B+X^{T}YX)^{-1}A^T )\\ & = A\left(\frac{\partial}{\partial X_{i,j}}(B+X^{T}YX\right)^{-1})A^{T}\\ & = -A(B+X^{T}YX)^{-1}\left(\frac{\partial}{\partial X_{i,j}}(B+X^{T}YX)\right)(B+X^{T}YX)^{-1}A^T \\ & = -A(B+X^{T}YX)^{-1}(X^TYJ^{ij} + J^{ji}YX)(B+X^T YX)^{-1}A^T \end{align} in which $(J^{ij})_{kl} = \delta_{ik} \delta_{jl}$, where $\delta$ denotes the Kronecker delta function. Note that in the above calculation I used \begin{align} &\frac{\partial}{\partial X_{i,j}}X^{T}YX = X^TYJ^{ij} + J^{ji}YX\\ &\frac{\partial}{\partial X_{i,j}}(Z(X))^{-1} = Z^{-1}\left(\frac{\partial}{\partial X_{i,j}}Z\right)Z^{-1} \end{align}

  • 0
    Just a question. When you write $U^{-T}$ you mean the inverse of the transpose?2012-12-01
  • 1
    Yes, $U^{-T}$ means inverse of the transpose2012-12-01
  • 0
    Last question... assuming that X is an arbitrary rectangular matrix m-by-n, B is m-by-m and all other matrices are n-by-n, is still possible to do all calculation you posted?2012-12-03