2
$\begingroup$

I can see that using the law of large numbers and perhaps mild conditions on the likelihood function, one can show the empirical Fisher information matrix uniformly converges to the true Fisher information matrix. I would like to know further how fast this uniform convergence is in terms of the number of observations?

It seems that there hasn't been much study on this problem and the literature is scarce. I tried to apply some of the results from Statistical Learning Theory, such as the analyses based on Rademacher Complexity, but those arguments seem to apply only to obtain large deviation bounds for the empirical process (i.e, the average negative log-likelihood function) and cannot be extended to its Hessian (i.e., the empirical Fisher information matrix).

I am not merely looking for asymptotic behavior which can be described by CLT, rather I would like to have large deviation-type inequalities for the Fisher information matrix. In particular I would like to know if and how fast the probability that the deviation between the spectrum of the empirical information matrix and that of the true information matrix exceeds some $\epsilon>0$ goes to zero as a function of the sample size. I'm interested in concentration inequality types of results.

EDIT: The paragraph above was added after the answer posted by -did, because I didn't have permission to add comments at that time and I wasn't aware of the editing rules. I accepted the answer by -did, but I think it addresses my problem only partially.

  • 0
    I don't think that the literature is scarce. Fisher information matrix is simply related to the variances regarding the observations. The law of large numbers is straightforward to be applied with some additional math.2012-07-14
  • 0
    @Seyhmus: CLT. $ $2012-07-14
  • 1
    A whole new paragraph was added to the question without any kind of warning or commenting, which modifies drastically the subject. Not wishing to condone such practices, I am out.2012-07-14
  • 0
    I wanted to post the changes as a comment, but as an unregistered user of this site I wasn't be able to make the changes in the comments, and even after I registered I couldn't do so because I didn't have enough reputation points. I didn't know other way than editing the question. I mentioned that the edit is to clarify the question, but it seems that it doesn't show up.2012-07-14
  • 0
    Got something from my answer below?2012-07-25
  • 0
    It didn't exactly answer what I needed. Surely, it describes the asymptotic behavior of the information matrix, but I'd like to know if and at what *rate* the empirical information matrix converges to the expected information matrix *in probability* ?2012-07-26
  • 0
    @did: My issue is that your answers address the questions I asked only partially. For example, for this question I wanted to know how fast the empirical Fisher Information matrix converges to the actual Fisher Info. matrix as the number of samples grows. In your answer, you only address convergence by CLT, and not the rate of convergence. As a relatively new user, I'm not familiar with every rule of the website. So let me know if I'm supposed to accept partial answers.2012-09-20
  • 1
    S.B. All this has already been said. (1.) You modified the question after an answer was posted. (2.) The CLT does provide a rate of convergence. (3.) If indeed you are not *familiar with the rules*, you could just check them.2012-09-20

1 Answers 1

4

You might have got the impression that the literature is scarce simply because this is a direct consequence of the central limit theorem.

To see this while keeping things simple, let us first examine the situation where one observes $n$ i.i.d. Bernoulli trials, each with probability of success $\theta$. Then the empirical Fisher information is $I_n(\theta\mid X_n)=\dfrac{X_n}{\theta^2}+\dfrac{n-X_n}{(1-\theta)^2}$, where $X_n$ denotes the number of successes. Hence, a first remark is that $I_n(\theta\mid X_n)$ diverges when $n\to\infty$ and that, to observe non degenerate asymptotics, one should consider $$ J_n(\theta\mid X_n)=\dfrac1n\cdot I_n(\theta\mid X_n). $$ To wit, recall that the central limit theorem asserts that $X_n=n\theta+\sqrt{n\theta(1-\theta)}\cdot Z_n$, where, when $n\to\infty$, $Z_n$ converges in distribution to a standard normal random variable. Using this in the expression of $J_n(\theta\mid X_n)$, one gets $$ J_n(\theta\mid X_n)=I(\theta)+\frac{K(\theta)}{\sqrt{n}}\cdot Z_n, $$ where $I(\theta)$ is the Fisher information $I(\theta)=\dfrac1{\theta(1-\theta)}$ and $K(\theta)=\dfrac{1-2\theta}{(\theta(1-\theta))^{3/2}}$. To sum up, in the Bernoulli case, $ \dfrac{I_n(\theta\mid X_n)-n\cdot I(\theta)}{\sqrt{n}} $ converges in distribution to a centered normal random variable with variance $K(\theta)^2=\dfrac{(1-2\theta)^2}{\theta^3(1-\theta)^3}$.

The result above is quite general, provided one replaces $I(\theta)$ by the Fisher information of the distribution considered and $K(\theta)^2$ by the relevant variance. When the distribution of the sample has density $f(\ \mid\theta)$, one gets $I(\theta)=\mathrm E(g(X_1\mid\theta))$ and $K(\theta)^2=\mathrm{Var}(g(X_1\mid\theta))$, where $$ g(x\mid\theta)=-\frac{\partial^2}{\partial\theta^2}\log f(x\mid\theta). $$ (In the Bernoulli case, $f(x\mid\theta)=\theta^x(1-\theta)^{1-x}$ hence $g(x\mid\theta)=\dfrac{x}{\theta^2}+\dfrac{1-x}{(1-\theta)^2}$ and one recovers the formulas given above.)

  • 0
    Thanks, for your answer. As I expressed in the edited question, I'd prefer to know non-asymptotic behavior of the empiricial information matrix.2012-07-14
  • 0
    Good for you. As I expressed in my comment, I do not find your *modus operandi* acceptable.2012-07-14