1
$\begingroup$

I am trying to prove that the observed information matrix evaluated at the weakly consistent maximum likelihood estimator (MLE), is a weakly consistent estimator of the expected information matrix. This is a widely quoted result but nobody gives a reference or a proof (I have exhausted I think the first 20 pages of google results and my stats textbooks)! I think this is why many people hate statistics, it is hard enough but made harder still with results quoted with no references!

Using a weakly consistent sequence of MLEs I can use the weak law of large numbers (WLLN) and the continuous mapping theorem to get the result I want. However I believe the continuous mapping theorem cannot be used. Instead I think the uniform law of large numbers (ULLN) needs to be used. Does anybody know of a reference that has a proof of this? I have an attempt at the ULLN but omit it for now for brevity.

I apologise for the length of this question but notation has to be introduced. The notation is as folows (my proof is at the end).

Assume we have an iid sample of random variables $\{Y_1,\ldots,Y_N\}$ with densities $f(\tilde{Y}|\theta)$, where $\theta\in\Theta\subseteq\mathbb{R}^{k}$ (here $\tilde{Y}$ is a just a general random variable with the same density as any one of the members of the sample). The vector $Y=(Y_1,\ldots,Y_N)^{T}$ is the vector of all the sample vectors where $Y_{i}\in\mathbb{R}^{n}$ for all $i=1,\ldots,N$. The true value of the densities is $\theta_{0}$, and $\hat{\theta}_{N}(Y)$ is the weakly consistent maximum likelihood estimator (MLE) of $\theta_{0}$. Subject to regularity conditions the Fisher Information matrix can be written as

$$I(\theta)=-E_\theta \left[H_{\theta}(\log f(\tilde{Y}|\theta)\right]$$

where ${H}_{\theta}$ is the Hessian matrix. The sample equivalent is

$$I_N(\theta)=\sum_{i=1}^N I_{y_i}(\theta),$$

where $I_{y_i}=-E_\theta \left[H_{\theta}(\log f(Y_{i}|\theta)\right]$. The observed information matrix is;

$J(\theta) = -H_\theta(\log f(y|\theta)$,

(some people demand the matrix is evaluated at $\hat{\theta}$ but some don't). The sample observed information matrix is;

$J_N(\theta)=\sum_{i=1}^N J_{y_i}(\theta)$

where $J_{y_i}(\theta)=-H_\theta(\log f(y_{i}|\theta)$.

I can prove convergence in probability of the estimator $N^{-1}J_N(\theta)$ to $I(\theta)$, but not of $N^{-1}J_{N}(\hat{\theta}_N(Y))$ to $I(\theta_{0})$. Here is my proof so far;

Now $(J_{N}(\theta))_{rs}=-\sum_{i=1}^N (H_\theta(\log f(Y_i|\theta))_{rs}$ is element $(r,s)$ of $J_N(\theta)$, for any $r,s=1,\ldots,k$. If the sample is iid, then by the weak law of large numbers (WLLN), the average of these summands converges in probability to $-E_{\theta}[(H_\theta(\log f(Y_{1}|\theta))_{rs}]=(I_{Y_1}(\theta))_{rs}=(I(\theta))_{rs}$. Thus $N^{-1}(J_N(\theta))_{rs}\overset{P}{\rightarrow}(I(\theta))_{rs}$ for all $r,s=1,\ldots,k$, and so $N^{-1}J_N(\theta)\overset{P}{\rightarrow}I(\theta)$. Unfortunately we cannot simply conclude $N^{-1}J_{N}(\hat{\theta}_N(Y))\overset{P}{\rightarrow}I(\theta_0)$ by using the continuous mapping theorem?

  • 0
    +1. Please: don't write \text{log}. Just write \log, with a backslash. This not only prevents italicization, but also results in proper spacing in things like $\log x$.2012-01-28
  • 0
    Thanks Michael - \log is also shorter and thus less effort to write!2012-01-28

1 Answers 1

1

I am answering my own question by showing using a uniform law of large numbers that the observed information matrix is a strongly consistent estimator of the information matrix , i.e. $N^{-1}J_{N}(\hat{\theta}_{N}(Y))\overset{a.s.}{\longrightarrow}I(\theta_{0})$ if we plug-in a strongly consistent sequence of estimators. I hope it is correct in all details.

We will use $I_{N}=\{1,2,...,N\}$ to be an index set, and let us temporarily adopt the notation $J(\tilde{Y},\theta):=J(\theta)$ in order to be explicit about the dependence of $J(\theta)$ on the random vector $\tilde{Y}$. We shall also work elementwise with $(J(\tilde{Y},\theta))_{rs}$ and $(J_{N}(\theta))_{rs}=\sum\nolimits_{i=1}^{N}(J(Y_{i},\theta))_{rs}$, $r,s=1,...,k$, for this discussion. The function $(J(\cdot,\theta))_{rs}$ is real-valued on the set $\mathbb{R}^{n}\times\Theta^{\circ}$, and we will suppose that it is Lebesgue measurable for every $\theta\in\Theta^{\circ}$. A uniform (strong) law of large numbers defines a set of conditions under which

$\underset{\theta\in\Theta}{\text{sup}}\left|N^{-1}(J_{N}(\theta))_{rs}-E_{\theta}\left[(J(Y_{1},\theta))_{rs}\right]\right|=\nonumber\\ \hspace{60pt}\underset{\theta\in\Theta}{\text{sup}}\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(Y_{i},\theta))_{rs}-(I(\theta))_{rs}\right|\overset{a.s}{\longrightarrow}0\hspace{100pt}(1)$

The conditions that must be satisfied in order that (1) holds are (a) $\Theta^{\circ}$ is a compact set; (b) $(J(\tilde{Y},\theta))_{rs}$ is a continuous function on $\Theta^{\circ}$ with probability 1; (c) for each $\theta\in \Theta^{\circ}$ $(J(\tilde{Y},\theta))_{rs}$ is dominated by a function $h(\tilde{Y})$, i.e. $|(J(\tilde{Y},\theta))_{rs}|

Now for any $y_{i}\in\mathbb{R}^{n}$, $i\in I_{N}$ and $\theta'\in S\subseteq\Theta^{\circ}$, the following inequality obviously holds

$\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(y_{i},\theta'))_{rs}-(I(\theta'))_{rs}\right|\leq\underset{\theta\in S}{\text{sup}}\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(y_{i},\theta))_{rs}-(I(\theta))_{rs}\right|.\hspace{50pt}(2)$

Suppose that $\{\hat{\theta}_{N}(Y)\}$ is a strongly consistent sequence of estimators for $\theta_{0}$, and let $\Theta_{N_{1}}=B_{\delta_{N_{1}}}(\theta_{0})\subseteq K\subseteq \Theta^{\circ}$ be an open ball in $\mathbb{R}^{k}$ with radius $\delta_{N_{1}}\rightarrow 0$ as $N_{1}\rightarrow\infty$, and suppose $K$ is compact. Then since $\hat{\theta}_{N}(Y)\in \Theta_{N_{1}}$ for $N$ sufficiently large enough we have $P[\underset{N}{\text{lim}}\{\hat{\theta}_{N}(Y)\in\Theta_{N_{1}}\}]=1$ for sufficiently large $N$. Together with (2) this implies

$P\left[\underset{N\rightarrow\infty}{\text{lim}}\left\{\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(Y_{i},\hat{\theta}_{N}(Y)))_{rs}-(I(\hat{\theta}_{N}(Y)))_{rs}\right|\leq\right.\right.\nonumber\\ \hspace{40pt}\left.\left.\underset{\theta\in\Theta_{N_{1}}}{\text{sup}}\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(Y_{i},\theta))_{rs}-(I(\theta))_{rs}\right|\right\}\right]=1.\hspace{100pt}(3)$

Now $\Theta_{N_{1}}\subseteq\Theta^{\circ}$ implies conditions (a)-(d) of Jennrich (1969, Theorem 2) apply to $\Theta_{N_{1}}$. Thus (1) and (3) imply

$P\left[\underset{N\rightarrow\infty}{\text{lim}}\left\{\left|N^{-1}\sum\nolimits_{i=1}^{N}(J(Y_{i},\hat{\theta}_{N}(Y)))_{rs}-(I(\hat{\theta}_{N}(Y)))_{rs}\right|=0\right\}\right]=1.\hspace{100pt}(4)$

Since $(I(\hat{\theta}_{N}(Y)))_{rs}\overset{a.s.}{\longrightarrow}I(\theta_{0})$ then (4) implies that $N^{-1}(J_{N}(\hat{\theta}_{N}(Y)))_{rs}\overset{a.s.}{\longrightarrow}(I(\theta_{0}))_{rs}$. Note that (3) holds however small $\Theta_{N_{1}}$ is, and so the result in (4) is independent of the choice of $N_{1}$ other than $N_{1}$ must be chosen such that $\Theta_{N_{1}}\subseteq \Theta^{\circ}$. This result holds for all $r,s=1,...,k$, and so in terms of matrices we have $N^{-1}J_{N}(\hat{\theta}_{N}(Y))\overset{a.s.}{\longrightarrow}I(\theta_{0})$.