2
$\begingroup$

Let $A,B,X$ be square matrices of the same size. I know $\ \displaystyle \frac{\mathrm{d}~ \mathrm{trace} (AX)}{\mathrm{d}X}=A.\quad$

Is it true that $\displaystyle \frac{\mathrm{d}~ B~ \mathrm{trace} (AX)}{dX}=B\otimes A$, where $\otimes$ means the Kronecker product?

  • 1
    Hmm... sort of, but for clarity you really should write how the derivative is applied to a tangent vector. In other words, you should specify what is the directional derivative, say, in the direction of some square matrix $Y$.2011-07-07

2 Answers 2

6

Let me give an argument why you shouldn't write it as a Kronecker product, at least not as how you've written it.

The Kronecker product is a nice way to represent tensor products of linear operators. That is to say, if $L_1:V_1 \to W_1$ is a linear operator between vector spaces $V_1$ and $W_1$, and if $L_2:V_2\to W_2$ is a linear operator, we can pick bases of $V_1, V_2, W_1, W_2$ and express the linear operators as matrices $[L_1]$ and $[L_2]$, while expressing vectors $v_1\in V_1$ and $v_2\in V_2$ relative to the bases as column vectors $[v_1]$ and $[v_2]$. Now, the nice property of the Kronecker product, is that the tensor product $L_1\otimes L_2$, which is a linear transformation from the tensor product space $V_1\otimes V_2$ to the tensor product space $W_1\otimes W_2$, can be written relative to the natural bases of those tensor product spaces, as

$$ [L_1\otimes L_2] = [L_1] \otimes_K [L_2] $$

(here using $\otimes_K$ to denote the Kronecker product operation on matrices). The advantage to this is that we can do the same for the vectors $v_1,v_2$: the vector $v_1\otimes v_2\in V_1\otimes V_2$ can be represented in the same basis as above, simply by

$$ [v_1\otimes v_2] = [v_1] \otimes_K [v_2] $$

AND the results

$$(L_1\otimes L_2)(v_1\otimes v_2) = (L_1v_1 \otimes L_2 v_2) \in W_1\otimes W_2$$

also represents nicely

$$ [L_1v_1\otimes L_2v_2] = [L_1][v_1]\otimes_K[L_2][v_2] = \left([L_1]\otimes_K [L_2]\right)\left([v_1]\otimes_K [v_2]\right) $$

(Just to clarify: the first term is taking the basis representation of the vector $L_1v_1\otimes L_2v_2]$; the second term is the Kronecker product of the two vectors formed by the multiplication of the matrix representations $[L_i]$ of the linear operators with the column representation $[v_i]$ of the vectors; the third term is the matrix multiplication of a larger matrix, obtained by the Kronecker product of the matrices of the linear operators, against a larger vector, obtained by the Kronecker product of the column vectors.)

All this is to say that, when one writes "Kronecker product", one needs to do so relative to the correct matrix representation of a linear operator, because the power of the Kronecker product lies in this compatibility. (There actually is a nice commutative diagram here, for those who know what commutative diagrams are.)

Now, if you just take $A$ and $B$ to be square matrices and taking the Kronecker product $A\otimes_K B$ of that, you are not respecting the true nature of the objects $A$ and $B$. What is $A$? Writing it as a square matrix "suggests" that you are thinking of it as a linear operator between two vector spaces of the same dimension. But, this is in fact not the case.

Consider the expression

$$ \frac{d}{dX} tr(AX) = A $$

Recall from multivariable calculus, or from real analysis, that the derivative of a function, evaluated at a point, is a linear functional. It takes input a tangent direction and outputs a real number (let's assume everything is real valued). Well, in this case your underlying space is the space of square matrices. And so the derivative takes as input a square matrix, and outputs a real number. That is: the derivative of $tr(AX)$ in the $Y$ direction, where $Y$ is a square matrix, is

$$ \lim_{t\to 0} \frac{1}{t}\left( tr(A(X+tY)) - tr(AX)\right) = tr(AY) $$

So $A$ here on the right hand side of the derivative expression is actually a linear operator sending square matrices to real numbers via the operation $Y\mapsto tr(AY)$.

So let us say we are looking at $N\times N$ matrices. $A$ on the right hand side of the derivative expression should be now taking to be a linear transformation from $V\to \mathbb{R}$, where $V$ is now the space of $N\times N$ matrices. So the "correct representation" of $A$, using that you can identify the space of $N\times N$ matrices as a vector space $V$ of $N^2$ dimensions, is $[A]$ a matrix that is $1\times N^2$ (which you can get by "unfolding" the rows of $A$).

So the correct representation of the tensor product $B\otimes A$ using a Kronecker product, properly capturing the nature of $A$ in your equation for the derivative of $B~tr(AX)$, is the Kronecker product of $[B]$ against $[A]$, where $[A]$, as above, should be a corresponding $1\times N^2$ matrix, representing the fact that it is a linear transformation from an $N^2$ dimensional vector space to the real numbers, while $[B]$ should be analogously described by thinking about what it really stands for.

(As you mentioned in the comment to leonbloy's answer, $B$ is obtained from applying the Leibniz rule to $tr(BX)tr(AX)$. So $B$ actually should also be written as $[B]$, some $1\times N^2$ matrix. By thinking like this, you clearly see how the positive semi-definiteness of matrices $A$ and $B$ are completely irrelevant when it comes to considering the convexity of the function $tr(BX)tr(AX)$.)

  • 0
    I am trying to digest your argument. Thank you for your time.2011-07-08
3

This somewhat depends on how you define a derivative of a matrix with respect to a matrix.

First, the natural way of defining the derivative of a scalar function $g(X)$ with respect to a matrix $X$ is another matrix $Y$ os same size as $X$:

where $$ \frac{d g(X)}{d X} = Y \;\; \Leftrightarrow \;\; Y_{i,j}=\frac{\partial g(X)}{\partial x_{i,j}}$$

Applying this definition you get your first identity. Now, if we derive a matrix function with respect to a matrix, you get either a 3D matrix (a tensor). Or, alternatively, you can express it as a block matrix ($n^2 \times n^2$). And in the particular case in which we must derive a matricial function of the form $B g(X)$, (with $B$ not dependent on $X$) we get (with this definition) a Kronecker product, as the natural generalization of the scalar case.

Added: Two references

Added 2: To illustrate the discussion in the comments. What the OP wanted to write was

$$\frac{\partial tr(A X)}{\partial X} = A $$

$$ \frac{\partial \; tr(A X) \; tr(B X)}{\partial X} = A \; tr(B X) + B \; tr(A X)$$

$$\frac{\partial \; B \; tr(A X)}{\partial X} = B \otimes A$$

with $A$ and $B$ positive definite. Hence

$$\frac{\partial^2 \; tr(A X) \; tr(B X)}{\partial X^2} = B \otimes A + A \otimes B $$

and conclude from this that, as this last matrix is positive definite, the Hessian of the function $tr(A X) \; tr(B X)$ is PSD and hence the function is convex.

This (as Willie Wong points out in the comments, and contrarily to what I first tended to believe) is false, and this shows that one must be careful when working with different definitions of matrix derivatives, when the order of the indexes matters.

To compute (safely) the Hessian, recall $X$, our 'variable', can be regarded as a $N \times N$ or as a $N^2 \times 1$ matrix (we can write ${\mathbf x} = vect(X)$ where $vect()$ is the operator that 'stacks' the matrix into a column vector). In this formulation, it's easy to see that the first derivative must be expressed as

$$\frac{\partial tr(A X)}{\partial {\bf x}} = {\bf a} = vect(A) $$

and $$ \frac{\partial^2 \; tr(A X) \; tr(B X)}{{\bf x}^2} = {\bf a}^T {\bf b} + {\bf b}^T {\bf a}$$

The analogy is evident, but it is less evident is they are equivalent, or if this matrix on the LHS (of size $N^2 \times N^2$) is also guaranteed to be PSD. Well, it's not. An example:

Let $A=\pmatrix{ 7 &1\cr1& 3\cr}$, $B = \pmatrix{ 5 &2\cr2& 1\cr}$ , both PSD.

Lets compute both Hessians (the correct and the false one) :

>>> D=vec(A)*vec(B)' + (vec(A)*vec(B)')' D =     70   19   19   22    19    4    4    7    19    4    4    7    22    7    7    6  >>> min(eig(D)) ans = -3.1664 >>> E = kron(A,B)+kron(B,A) E =    70   19   19    4   19   22    4    7   19    4   22    7    4    7    7    6  >>> min(eig(E)) ans =  0.98603 

We see that, while $E$ is PSD, as it should, the $D$, "true second derivative" matrix is not positive definite. And that both matrices are equivalent only up to some row/colum rearrangement ... and of course, PSD is sensitive to such rearrangements.

  • 0
    So in the natrual sense, my derivation is correct?2011-07-07
  • 0
    I'd say yes. But bear in mind that there can be more than one "natural" definition, eg trasposing the row-column indexes, or the "order" of the blocks (i.e., the order of the Kronecker product)2011-07-07
  • 1
    My consideration starts here.... Let $A,B$ be positive semidefinite and consider $f(X)=trace (AX) trace (BX)$ defined on positive semidefinite cone. The Hessian of $f(X)$ is $A\otimes B+B\otimes A$, which implies $f(X)$ is convex, but it seems not true if I use the definition to verify convexity of $f(X)$.2011-07-07
  • 0
    Interesting! Perhaps you should expand your question with this. I'm not 100% sure of the implication (though I'm inclined to believe it). I'm curious about your (contra)verification.2011-07-07
  • 0
    @Sunni: I don't see how you expect the Hessian condition gives rise to convexity. The tangent space at a point in the pos. semidef. cone is ALL the square matrices. So you can't guarantee that the Hessian is itself pos. semidef. (as a bilinear form on the space of square matrices). For example, take $tr(AB) = 0$ and evaluate the Hessian on $Y = A-B$.2011-07-07
  • 0
    @Willie: Perhaps I am confused with this. For $x^TAx$, where $x$ is a column vector, it is convex iff $A$ is psd.2011-07-07
  • 0
    But we can here regard a matrix as a stacked vector (vect operation), and this indeed is the way in which the matrix derivative expressesed as a krnoecker product arises. What I don't still get is why you want to restrict the domain of $f(X)$ to the positive definite matrices.2011-07-07
  • 0
    @Sunni: in this case, like leonbloy said, the vector $x$ is actually some matrix $Y$. And the matrix "A" is the mapping $Y\mapsto tr(BY)A + tr(AY)B$. (Your vector space is the space of square matrices!) So your Hessian is **not** positive semidefinite!2011-07-07
  • 0
    @Put it slightly differently: $tr(AB)$ represents a inner product on the space of matrices. And you should consider the following analogy: let $u,v$ be fixed vectors in $\mathbb{R}^n$ with all nonnegative components (this is, roughly speaking, to capture the PSD condition). Consider the function defined on $\mathbb{R}^n$ by $x\mapsto (u\cdot x)(v\cdot x)$. Its Hessian is $u\otimes v + v\otimes u$ (now the tensor product makes it a matrix in the usual sense). And it is easy to see that this Hessian is _not_ psd, unless $u=v$!2011-07-07
  • 0
    This begs the question if the Kronecker product of two positive definite matrices is (in the new expanded space) also positive definite... Is it?2011-07-07
  • 0
    @Willie : what I don't get (perhaps because I dont have much familiarity with Kronecker product) is how you know that $u\otimes v + v\otimes u$ is not psd.2011-07-07
  • 0
    @leonbloy: consider first the case $u$ and $v$ are both unit vectors, then clearly $u\cdot (u-v)$ is positive and $v\cdot (u-v)$ is negative. Hence evaluating the quadratic form $(u\otimes v)(u-v,u-v)$ is a product of a positive number with a negative number. For the general case write $u\otimes v$ as $|u||v| \hat{u}\otimes \hat{v}$, and evaluate against $(\hat{u} - \hat{v}$.2011-07-07
  • 0
    The Kronecker product of two PSD matrices is *indeed* PSD. The problem is that *that particular matrix* is not a representation of the the "Hessian matrix". This all comes down to "which indices are you acting the object on", which I referred to in my comment to the question, and which you alluded to in your answer.2011-07-07
  • 0
    +1 The second addendum is nicely written, it is especially good that your provided an example!2011-07-07
  • 0
    I see now. Thanks to you two.2011-07-08