I'm working on using the Generalized Method of Moments to analyze some yogurt purchase data, and in the course of trying to implement the standard Hansen method (i.e. not an empirical likelihood method), I need to compute first and second derivatives of the following function:
$Q(\theta) = \biggl[\frac{1}{N}\sum_{i=1}^{N}\psi(Z_{i},\theta)\biggr]^{T}C\biggl[\frac{1}{N}\sum_{i=1}^{N}\psi(Z_{i},\theta)\biggr].$
Here, $\psi(Z_{i},\theta)$ is a vector function (in my case a 9-by-1 column vector of the moment conditions; you can just think of each component as a function of the scalar parameter $\theta$ if you wish). The $Z_{i}$ are the individual purchase data. $C$ is a weight matrix derived from the model assumptions, but you can treat it as just the identity matrix of suitable size if you want; it shouldn't matter as it isn't a function of $\theta$.
If I let $F(\theta) = \biggl[\frac{1}{N}\sum_{i=1}^{N}\psi(Z_{i},\theta)\biggr]$ for simplicity, then the place I am getting stuck is computing the first and second derivatives of $Q_{C}$ w.r.t. $\theta$. This is a scalar-valued objective function of a single variable, so everything involved should work out to be a scalar.
Based on the Wikipedia article on Matrix Calculus, here is what I have tried thus far. $\frac{dQ}{d\theta} = \frac{dQ}{dF}\cdot{}\frac{dF}{d\theta} = \biggl[ F(\theta)^{T}(C+C^{T})\biggr]\cdot{}\biggl[\frac{d}{d\theta}F(\theta)\biggr]$
Next I want to take the derivative again, so I use the matrix chain and product rules. In my case, it happens that the final term, $\frac{d}{d\theta}F(\theta)$ is no longer a function of $\theta$ (just constants in all components), so its derivative will be zero and we only need to worry about the first part of the product.
$\frac{d}{d\theta}\biggl[ F(\theta)^{T}(C+C^{T})\biggr]\cdot{}\biggl[\frac{d}{d\theta}F(\theta)\biggr].$
As far as I can tell, this just results in the following: $ \frac{d}{d\theta}\biggl[ F(\theta)^{T}(C+C^{T})\biggr]\cdot{}\biggl[\frac{d}{d\theta}F(\theta)\biggr] = \biggl(\frac{d}{d\theta}F(\theta) \biggr)^{T}(C+C^{T})\biggl(\frac{d}{d\theta}F(\theta) \biggr).$
This gives a nice formula, but when I use these results for the first and second derivatives to program up Newton's method to find the value of $\theta$ that minimizes the quadratic form, the method is not converging, and I am concerned that it is because I have calculated the derivatives incorrectly (missing a transpose, or something like that).
Additionally, links to good, clearly written references that explain the logic behind matrix calculus, especially when and why transpositions occur, would be appreciated. Almost all references I could find in 30+ minutes of Googling were absolutely inscrutable and tended to just state results with no expositions at all.