3
$\begingroup$

We've learned this algorithm in class but I'm not sure I've fully understood the correctness of this approach.

It is known as Welford's algorithm for the sum of squares, as described in: Welford, B.P., "Note on a Method for Calculating Corrected Sums of Squares and Products", Technometrics, Vol. 4, No. 3 (Aug., 1962), pp. 419-420

The two-pass algorithm to estimate the sample deviation is simple. We first estimate the mean by one pass and then calculate the standard deviation. In short, it is $s_n^2 = \frac{1}{n-1} \sum_{i=1}^n (Y_i - \hat{Y})^2$.

The one-pass approach has three steps.

1) $v_1 = 0$

2) $v_k = v_{k-1} + \frac{1}{k(k-1)} (\sum_{i=1}^{k-1} Y_i - (k-1) Y_k)^2 (2 \leq k \leq n)$

3) $s_n^2 = \frac{v_n}{n-1}$.

Can someone help me understand how this approach works?

  • 2
    This isn't quite "one-pass" because step (2) contains a sum from $i=1$ to $k-1$. Better to say something like $m_1 = Y_1$, $m_k = m_{k-1} + Y_k$, $v_k = v_{k-1} + \frac{1}{k(k-1)} (m_{k-1} - (k-1)Y_k)^2$2012-02-29

1 Answers 1

4

$v_n$ is going to be $\sum_{i=1}^n (Y_i - \overline{Y}_n)^2$ where $\overline{Y}_n = \frac{1}{n} \sum_{i=1}^n Y_i$. Note that by expanding out the square, $v_n = \sum_{i=1}^n Y_i^2 - \frac{1}{n} \left(\sum_{i=1}^n Y_i\right)^2$. In terms of $m_k = \sum_{i=1}^k Y_i$, we have $v_n = \sum_{i=1}^n Y_i^2 - \frac{1}{n} m_n^2 = \sum_{i=1}^{n-1} Y_i^2 + Y_n^2 - \frac{1}{n} (m_{n-1} + Y_n)^2$ On the other hand, $v_{n-1} = \sum_{i=1}^{n-1} Y_i^2 - \frac{1}{n-1} m_{n-1}^2$ Since $(m_{n-1} + Y_n)^2 = m_{n-1}^2 + 2 Y_n m_{n-1} + Y_n^2$, the difference between these is $ v_n - v_{n-1} = \left(1 - \frac{1}{n}\right) Y_n^2 + \left(\frac{1}{n-1} - \frac{1}{n}\right) m_{n-1}^2 - \frac{2}{n} Y_n m_{n-1} $ $ = \frac{1}{n(n-1)} m_{n-1}^2 - \frac{2}{n} Y_n m_{n-1} +\frac{n-1}{n} Y_n^2 $ $ = \frac{1}{n(n-1)} \left(m_{n-1} - (n-1) Y_n\right)^2 $

  • 0
    If I understand this correctly, one initializes $m_0 = v_0 = v_1 = 0$. One would then begin calculating $m_n$ from the first value ($n=1$), but wait to begin calculation of $v_n$ until the second value ($n=2$). $v_1 = (Y_1-\bar{Y_1})^2$ evaluates to zero; however, if, at the first value in the series ($n=1$), one uses the recursive relation, $v_1=v_0+\frac{1}{n(n-1)}\left(m_{n-1}-(n-1)Y_n\right)^2$, the expression diverges due to the $\frac{1}{n-1}$ term. Therefore, the recursive relation for $v_n$ should be delayed until the second value in the series.2012-11-01