1
$\begingroup$

So I have a data set $(x_{1},y_{1}), (x_{2},y_{2}),\dots,(x_{n},y_{n})$ and from it I have the values of $\sum x$, $\sum x^{2}$, $\sum y$, $\sum y^{2}$, $\sum xy$.

My question is, how do I find a normal distribution that best fits this data set and how do I use these values to calculate the standard deviation for the normal distribution?

Basically, given a data set, how do I find the values of the mean and standard deviation for the normal distribution of best fit? Are they the same as the mean of the data set?

  • 0
    It seems that you have bivariate data (i.e. data for a scatterplot). Do you want to approximate the xi or the yi or do you want a bivariate normal distribution?2012-12-22
  • 0
    @Hans Engler: Yes, this is exactly what I want, a normal distribution for which (n,f(n))= $(x_{n},y_{n})$. How is this accomplished?2012-12-23
  • 3
    I asked an "either or" question and you responded "yes", so I am confused now. Which data are in your opinion approximately normally distributed?2012-12-23
  • 0
    As @Hans said. Also, your equation $(n,f(n))=(x_n,y_n)$ is confusing -- does that mean that $x_n=n$? In that case, why did you introduce the $x_n$ in the first place? Also, this equation appears to imply that you're considering a one-dimensional function. In that case, I wonder whether you're actually simply trying to fit a normalized Gaussian to a set of data points and the whole distribution aspect is just a distraction. In any case, what do you mean by "the mean of the data set"?2012-12-23
  • 0
    @Hans Engler: Oops, sorry. I did mean a bivariate normal distribution. Unfortunately, I am not sure how to use this to produce the desired result.2012-12-23
  • 0
    @joriki: Yes, I am just trying to "fit a normalized Gaussian to a set of data points" and I thought that a distribution was the way to do this. I guess I was wrong. But, just to clarify, my question remains: "Given a data set, how do I find the values of the parameters (such as mean and standard deviation) in the equation for a normal distribution?"2012-12-23
  • 0
    @Hans Engler: I did some searching and found this, which might answer the question ([http://www.math.uri.edu/~pakula/452webs8/regress.pdf](http://www.math.uri.edu/~pakula/452webs8/regress.pdf)) I am not sure exactly how to apply it though.2012-12-26

2 Answers 2

1

You need also $\sum x y$, otherwise you would exclude all the normal distributions where there is dependence between $X$ and $Y$.

The normal distribution that best fits the data is obtained by maximum likelihood estimation. It is the one that has the mean and covariance matrix equal to the empirical mean and empirical covariance matrix corresponding your sums (normalized by $n$).

  • 0
    Thanks, I will try this approach. Actually, I am just concerned with the distributions where y depends on x, that is where the points have the form (n,f(n))2012-12-23
  • 0
    @RickyT Even if you are interested in the conditional distribution of $Y$ on $X$, not considering $\sum x y$ is akin to putting it equal to zero and having a regression coefficient of 0.2012-12-24
  • 0
    @Learner: I edited the question, perhaps I should have phrased it better. Same question, what if we did consider $\sum xy$?2012-12-26
  • 0
    @RickyT Do you know matrix algebra?2012-12-27
1

You have the sufficient statistics for $\mu_X, \mu_Y, \sigma^2_X$ and $\sigma^2_Y$ so you can calculate their estimates directly using $$ \bar{x} = \frac{1}{n}\sum_{i = 1}^n x_i, \,\,\, \bar{y} = \frac{1}{n}\sum_{i = 1}^n y_i $$ for the sample means and $$ s^2_x= \frac{1}{n-1} \sum_{i=1}^n\left(x_i - \bar{x} \right)^ 2 = \frac{\sum_{i=1}^nx_i^2}{n-1} - \frac{n\bar{x}^2}{n-1} \\ s^2_y= \frac{1}{n-1} \sum_{i=1}^n\left(y_i - \bar{y} \right)^ 2 = \frac{\sum_{i=1}^ny_i^2}{n-1} - \frac{n\bar{y}^2}{n-1} $$ for the sample variances. As others have mentioned, without $\sum{xy}$ you will not be able to estimate the covariance between $X$ and $Y$, which the regression tag in your question suggests you want.

  • 0
    Thanks a lot for this, it seems to be along the lines of what I am looking for. How would it change if we knew the values of $\sum xy$? That is, if we were given $\sum xy$, would we be able to produce the desired result? What exactly do the sample means represent in this situation?2012-12-26
  • 0
    If you have $\sum xy$ you can estimate the covariance between $X$ and $Y$ like this: $\tfrac{1}{n - 1} \sum xy - \bar{y} \sum x - \bar{x} \sum y + \bar{x}\bar{y}$. The sample means are the center of your estimated two-dimensional normal distribution.2012-12-30