3
$\begingroup$

Am trying to write a simple program that can take an arbitrary data-set of [x,y] pairs from given file, analyzes and prints any interesting statistical characteristics.

Of the things am interested in, is printing some statistical description of the data based on things like statistical correlation. But now my problem is that their is no information given to the program about the probability distribution from which the sample was taken, and thus such things as Cov(X,Y) seem to evade me since the formula:

$$Cov(X,Y)=\langle XY\rangle - \mu_x\mu_y$$

requires that am able to calculate the Expectation of XY, which in turn requires that I know the probability density function of the source. So what can I do to obtain the $Cov(XY)$ when I can only calculate $mean(x), mean(y) ,var(x) $ and $var(y)$?

Eventually, am interested in saying something about the correlation between X and Y.

4 Answers 4

2

Assuming zero means, and applying Cauchy-Schwarz: $$ |Cov(X,Y)|=|E(XY)| \le \sqrt{E(X^2) E(Y^2)} = \sqrt{Var(X) Var(Y)}$$ The same result can be obtained for non-zero mean, and this bound is all you can get from marginal (mean and variance) information.

The, the extremes (in absolute value) of the covariance are realized for $X,Y$ independent, $Cov(X,Y)=0$, and for $X=Y$, $Cov(X,Y) = Var(X)=Var(Y)$.

  • 0
    The extremes of the covariance are not the variance and zero.2011-09-21
  • 1
    Why not? The extremes (in absolute value) are zero and the geometric mean of the variances - which in the particular case of X=Y is the (common) variance. The general realization of the extreme would be $Y=aX$. My point, of course, is that knowing the marginal you can't know the covariance, but at least you can bound it (you know a little something).2011-09-21
  • 0
    If you want to describe the extremes in absolute value, you should write *the extremes in absolute value* and not *the extremes*...2011-09-21
  • 0
    ah, you've seem to have read "The extremes (of the covariance)" when I implied "The extremes (of the above equation)". Clarified.2011-09-21
  • 2
    To add to Didier Piau's comment, actually the Cauchy-Schwarz inequality gives $-\sigma_x\sigma_y \leq {\textit cov}(X,Y) \leq \sigma_x\sigma_y$ where the upper bound is satisfied with equality if $X = Y$ and the lower bound is satisfied with equality if $X = -Y$.2011-09-21
  • 0
    Improved statement: The Cauchy-Schwarz inequality gives $-\sigma_X\sigma_Y \leq {\textit cov}(X,Y) \leq \sigma_X\sigma_Y$ and if $Y = aX +b$ where $a$ and $b$ are real numbers, then the upper bound (respectively lower bound) is satisfied with equality if $a > 0$ (respectively $a < 0$).2011-09-21
7

So what can I do to obtain [the covariance of $X$ and $Y$] when I can only calculate [their means and variances]? Nothing, I am afraid.

For an example, consider a standard normal variable $X$. If $Y=X$, then both means are zero, both variances are $1$ and the covariance of $X$ and $Y$ is $+1$. If $Y=-X$, then both means are zero, both variances are $1$ and the covariance of $X$ and $Y$ is $-1$. This shows you must know something else than the means and variances to get the covariance.

  • 0
    so in my situation, could there be any way to obtain an approximation of $Cov(X,Y)$ or at least something I can do to approximate the source distribution? Or is their some distribution that when used might validly approximate some unknown distribution so that I can then approximate the $Exp(XY)$ and thus obtain $Cov(X,Y)$? Thanks2011-09-21
  • 1
    Well, if what you have is a set of $n$ points $(x_k,y_k)$, you can use the empirical mean of $XY$, that is replace $E(XY)$ by $\frac1n\sum\limits_{k=1}^nx_ky_k$.2011-09-21
4

If you can calculate ${\textit mean}(x)$ which I assume is the sample mean $$ {\textit mean}(x) = \frac{1}{n}\sum_{i=1}^n x_i $$ of your data set as opposed to the expectation $\mu_x$ which requires knowledge of the probability distribution, and similarly sample variance $$ {\textit var}(x) = \frac{1}{n-1}\sum_{i=1}^n (x_i - {\textit mean}(x))^2 $$ then you should be able to calculate a sample covariance for your samples as well using something like $$ {\textit cov}(x,y) = \frac{1}{n-1}\sum_{i=1}^n (x_i - {\textit mean}(x))(y_i - {\textit mean}(y)). $$ Sample means, sample variances, and sample covariances are (unbiased) estimators of the means, variances and covariance of the underlying probability distribution that "generated" the sample pairs $(x_i, y_i), i = 1, 2, \ldots n,$ in your data set.

  • 0
    yep! this is helpful for my situation :D thanks.2011-09-21
3

I don't think there is a way if you have just the means and the variances. But if you have the individual observations then you can estimate the covariance by the sample covariance $$\frac{1}{N-1}\sum_{i=1}^N (x_i-\bar x)(y_i-\bar y)$$ where $N$ is the number of observations, $(x_i,y_i)$ are the observations and $\bar x$ and $\bar y$ are the sample means of $X$ and $Y$ respectively. You will find this covered in any elementary statistics book.