4
$\begingroup$

I was looking up how standard deviations work, and tried to do it by hand, but what I'm getting isn't matching up with what I have in SQL Server.

My population is: 1,2,3,4,5,6,7,8,9. I end up with 2.58 using the method described on wikipedia. However, SQL Server AND Microsoft Excel come up with 2.73.

The final part of my equation looks like this: sqrt(60/9)

So, am I messing something up? If so, where?

  • 0
    You have spotted an important thing and should watch the videos explaining the need for Bessel correction, https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/review-and-intuition-why-we-divide-by-n-1-for-the-unbiased-sample-variance. Also, note that there are two std_dev functions in Excel: `stdev.p` and `stdev.s`. The latter involves the Bessel correction, which I think you should do if you want to estimate the population variance using the sample variance.2013-11-20

1 Answers 1

6

If you look closely at the Wikipedia article that you quoted, you will see that you computed either what they call the "population standard deviation" or what they call the "standard deviation of the sample," and not what they call the "sample standard deviation!" To find the sample standard deviation, we need to divide $60$ by $8$, not $9$, and then take the square root. With your numbers, that turns out to be about $2.7386$.

The terminological distinction that Wikipedia makes between "standard deviation of the sample" and "sample standard deviation" is not universal. But let's deal first with population standard deviation. If $X$ is a random variable with mean $\mu$, then the variance of $X$ is equal to $E(X-\mu)^2.$ In particular, if the random variable $X$ has uniform distribution on the set $\{1,2,\dots,9\}$, then the variance of $X$ is indeed the $60/9$ that you calculated.

With samples, the situation is more complicated. There are two commonly used estimators of the population variance. If we have a sample $X_1,X_2, \dots, X_n$ of size $n$, then one standard estimator for the population variance is $\frac{1}{n}\sum_{i=1}^n (X_i-\overline{X})^2$ and the other is $\frac{1}{n-1}\sum_{i=1}^n (X_i-\overline{X})^2,$ where in each case $\overline{X}$ is the sample mean $\frac{1}{n}\sum_{i=1}^n X_i$. There are technical reasons in favour of each. The second one is an unbiased estimator of the population variance, when that exists, and is more commonly preferred. (The situation is more complicated than that. Although the first estimator is on average wrong, it is on average, in the mean-square sense, closer to the truth! And in general, the square root of neither of our expressions is an unbiased estimator of the population standard deviation.)

When $n$ is large, there is not much difference between the two estimators, so we don't need to pay attention to what version is being used. But for $n=9$, it does make a significant difference, and we need to be aware of what formula a piece of software uses.

  • 0
    @Michael Hardy: It sounds like something one would make up to be funny. (And I would have invented it, if I had a sense of humor.)2011-10-12