0
$\begingroup$

Let $x\sim\mathcal{N}(\mu_x, \sigma_x^2)$ and $y\sim\mathcal{N}(\mu_y, \sigma_y^2)$ be two real random variables, normally distributed as above.

Let also $Q$ be the following non-negative quantity: $$ Q = \left\lvert\frac{1}{n}\sum_{i=1}^{n}y_i - \frac{1}{n}\sum_{i=1}^{n}x_i\right\rvert, $$ where $x_i$'s, $y_i$'s are drawn from the distributions $\mathcal{N}(\mu_x, \sigma_x^2)$ and $\mathcal{N}(\mu_y, \sigma_y^2)$, respectively.

If $n\to\infty$, then $Q\to\lvert\mu_y-\mu_x\rvert$. Is there any error bound for the above estimation? To put it better: can we have a rule about the sampling size $n$, so that the above estimation of that distance between the sample means are of some given accuracy $\epsilon$?

More interestingly, if $\mathbf{x}\sim\mathcal{N}(\mu_x, \Sigma_x)$ and $\mathbf{y}\sim\mathcal{N}(\mu_y, \Sigma_y)$ are multi-variate ($d$-dimensional) normal vectors, of given means and covariance matrices, and $\mathbf{x}_i$, $\mathbf{y}_i$ are drawn from them, respectively, what would be the error for the estimation of the norm $\lVert\mu_y-\mu_x\rVert$ by the quantity $$ \left\lVert\frac{1}{n}\sum_{i=1}^{n}\mathbf{y}_i - \frac{1}{n}\sum_{i=1}^{n}\mathbf{x}_i\right\rVert $$

Again, what I'm looking for, is a relation between the sample size $n$ and the dimensionality $d$, so that the estimation is of some accuracy (say $\epsilon$).

It seems reasonable to me that, if the dimensionality of the variables $\mathbf{x}$ and $\mathbf{y}$ are getting high, then the sample size $n$ needed to achieve a "good" (of given error) estimation of the above quantity, would get high, too. The question is how much high? I guess it wouldn't be just a polynomial relation, but maybe exponential. But this is only a guess.

The above might be trivial for statisticians, but I'm not sure even about where to look for. Could you help me with some guidance?

1 Answers 1

1

Comments:

You could start by using Chebyshev's Inequality to get a bound on $|\bar X_n - \mu_x|$ that shows $\bar X_n \stackrel{prob}{\rightarrow} \mu_x.$ Similarly for the $Y_i,$ and so on.

However, in practice, it might be more useful to find the normal distribution of $\bar Y_n - \bar X_n$, and observe that its variance decreases to $0$ with increasing $n.$ For specified numerical values of the parameters, you could get the exact distribution of $Q,$ based on this normal distribution. (The Chebyshev bounds are good enough to show convergence in probability, but generally too loose to be useful in practice. Chebyshev's Inequality works for all distributions with a finite variance, so the bounds tend not to be very tight for any one distribution.)

If you look at the development of 'Hotelling's $T^2$ distribution' you might find approaches useful for the multivariate version of your question.

  • 0
    Thank you very much for taking the time to answer my question! So, it seems that there is no general method for working on such stuff. Actually, I used this simple quantity (Q) just because it makes the problem quite clear (I think). In fact, I'm interested in quantities of the form $\max(0,\mathbf{a}^\top\mathbf{x}+b)+\max(0,\mathbf{a}^\top\mathbf{y})$, which I would like to estimate using $n$ "observations" of each random vector ($\mathbf{x}$ and $\mathbf{y}$), by taking the means. Do you think I should try to find how this quantity is distributed?2017-01-12
  • 1
    Too often users on this site think it appropriate to suggest our [companion site](http://stats.stackexchange.com) for any but the most trivial or mathematical of statistics questions. But _your question_ has the kind of 'big-data' and 'multivariate' flavor that makes me think you might get a more comprehensive answer there.2017-01-12
  • 0
    Thank you very much for your time and your comments anyway! I'll do as you suggest (I like CVSE, it's a good community, but often less responsive than MathSE). I'll also try your suggestions above; actually I would like to have a (toy) example, where I could show that when the dimensionality $d$ increases, one would need (maybe exponentially?) more samples (maybe $n=a\exp(bd)$?) to have a good estimation (well, $\exp$ is just a guess). Thanks again for your help; it's much appreciated :)2017-01-12