2
$\begingroup$

I'm looking for a formula, to iteratively calculate the mean and standard deviation of a huge list of data points.

I found some examples here (formula 15 f.) and here, but both seem to be falling for my very simple testcase [10,100].

Source 1 states:

$M_1 = x_1$

$S_1 = 0$

$M_k = M_{k-1}+(x_k-M_{k-1})/k$

as well as

$S_k = S_{k-1}+(x_k-M_{k-1})*(x_k-M_k)$

with

$\sigma = \sqrt{S_n/(n-1)}$

This leads me to $M_1 = 10, S_1 = 0$ and $M_2 = 10+(100-10)/2 = 55$ but $S_2 = 0+(100-10)*(100-55) = 4050$ and therefore with $n=2$ to $\sigma \approx 63.6396$. The correct value is $45$, which I get, when I plug in $n = 3$ in the formula for $\sigma$.

Do I understand the formula wrong?

Source 2:

$M_{n+1}=M_n+x_{n+1}$

$S_{n+1}=S_n+\frac{(n*x_{n+1}−M_n)^2}{n(n+1)}$

with the mean given by

$\bar{x}_n= \frac{M_n}{n}$

and the unbiased estimate of the variance is given by

$\sigma_n^2=\frac{S_n}{n+1}$

which leads me to

$M_1 = 10, M_2 = 110, S_1 = 0$

$S_2 = 0+\frac{(2*100-10)^{2}}{2(2+1)} = 6016.6667$

however, if I plug in $n=1$ again this is correct. I feel, that my understanding of indexes is wrong, but why?

  • 1
    For the first one, I think you are just using a different estimator for $\sigma.$ The formula given computes $\sqrt{\frac{1}{n-1} \sum_i (X_i-\bar X )^2}$ (which is the usual definition of sample std deviation) and you seem to be comparing it to $\sqrt{\frac{1}{n} \sum_i (X_i-\bar X)^2}$, another popular choice.2017-02-17
  • 0
    @Dschoni index notation??2017-02-19

2 Answers 2

2

You already noticed that using n + 1 (in your example: 3) in the first formula gives the correct answer while using n (in your example: 2) does not.

We can write this differently: the recursions for $M_n$ and $S_n$ are completely correct, but rather than computing the variance as $S_n/(n-1)$ you want to compute the variance as $S_n/n$.

This is a well known phenomenon:

$S_n/n$ is indeed the variance of the data points you have and hence the answer to your question.

$S_n/(n-1)$ is the best estimate of the (unknown) true variance of the larger underlying population you were drawing your random sample from.

This is quite counter-intuitive, but luckily there is a Wikipedia page dedicated to the issue: https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

  • 0
    Long story short: Better use n than n-1.2017-02-21
1

The formula that you need is about half way down the wikipedia page on the standard deviation in the section Identities & Mathemaical properties (last formula in the middle.

Personally in computer code I would calculate three quantities \begin{eqnarray*} S1=\sum_{i=1}^{N} 1 =N\\ Sx=\sum_{i=1}^{N} x_i \\ Sxx=\sum_{i=1}^{N} x_i^2 \end{eqnarray*} It is obvious how to iterate these. The mean & standard deviation are easily calculated as follows \begin{eqnarray*} \mu_N= \frac{Sx}{S1} \\ \sigma_N=\sqrt{ \frac{Sxx-Sx^2}{S1}} \end{eqnarray*} It is this final formula that is in wiki & I can never seem to remember ! but is easy to derive from scratch.