I can follow the steps but I don't really understand what it wants to say. The whole equation is done to prove the inequality, what does inequality say?
Proof for Maximum Likelihood Estimation
-
0try to write down the likelihood of the model yourself. Assume $\mu =0$ for simplicity. The likelihood will be of the form given on the LHS of the inequality with $(1/2b) B$ being the sample covariance matrix. The inequality shows that the likelihood is maximized at the sample covariance matrix. – 2017-01-04
-
0Here's a Wikipedia article on this. https://en.wikipedia.org/wiki/Estimation_of_covariance_matrices I've always thought it's mildly interesting that in this argument it is fruitful to view a scalar as the trace of a $1\times1$ matrix rather than simply as a scalar. $\qquad$ – 2017-01-05
1 Answers
Assume $x_i \sim N(0,\Sigma), i=1,\dots,n$. Assume that $x_i \in \mathbb R^d$. We have $$p_\Sigma(x_i) = (2\pi)^{-d/2} |\Sigma|^{-1/2} \exp(-\tfrac12 x_i^T \Sigma^{-1} x_i).$$ It is easier to reparametrize in terms of the precision matrix $\Gamma := \Sigma^{-1}$. Then, we have $$ p_\Gamma(x_i) \; \propto_{\Gamma} \;|\Gamma|^{1/2} \exp(-\tfrac12 x_i^T \Gamma x_i) $$ where $\propto_\Gamma$ means that the LHS, viewed as a function of $\Gamma$ is proportional to RHS. That is, we are suppressing the constant $(2\pi)^{-d/2}$ which does not depend on $\Gamma$. Note also that $p_\Gamma(x_i)$ viewed as a function $\Gamma$ is what statisticians call the likelihood (in this case based on a single sample $x_i$.)
Then, the joint density is $p_\Gamma(x) = \prod_i p_\Gamma(x_i)$. Hence, the joint likelihood is $$\ell(\Gamma|x) := p_\Gamma(x) \;\propto_\Gamma\; |\Gamma|^{n/2} \exp\Big(-\frac12\sum_i x_i^T \Gamma x_i\Big).$$ We use the following trace trick $$x_i^T \Gamma x_i = \text{tr}(x_i^T \Gamma x_i) = \text{tr}(\Gamma x_i x_i^T) $$ where the first equality is (as was pointed out) is since a scale viewed as a 1x1 matrix is equal to its trace. The second equality is using invariance of the trace to circular shift of its arguments.
Using the linearity of the trace, the likelihood is $$\ell(\Gamma|x \;\propto_\Gamma \;|\Gamma|^{n/2} \exp\Big[{-\frac12 \text{tr}\big(\Gamma \sum_i x_i x_i^T\big)}\Big].$$ Defining $S = \frac1n \sum_i x_i x_i^T$, we can write $$\ell(\Gamma|x) \;\propto_\Gamma\; |\Gamma|^{n/2} \exp\Big[{-\frac{n}2 \text{tr}\big(\Gamma S \big)}\Big].$$
We can reparametrize back by setting $\Gamma = \Sigma^{-1}$.
What the inequality is showing is that the function $$ \Sigma \mapsto |\Sigma|^{-n/2} \exp(-\frac{n}2 \Sigma^{-1} S) $$ is maximized at $\Sigma = S$. (Here $b = n/2$, $B = n S$ to match your notation.)
-
0Although I still have difficulty to understand the whole calculation, but still trying to know. Could you show me some more steps of above calculations? I will also appreciate some more explanations, if possible. – 2017-02-24
-
0@user122358, Added some more details. – 2017-02-25
