2
$\begingroup$

I am writing code to calculate statistical moments (mean, variance, skewness, kurtosis) for large samples of data and have the requirement of needing to be able to calculate moments for subsections of the sample (in parallel), then combine/merge them to give the moment for the sample as a whole.

For example:

$S = \lbrace 1.0, 1.2, 2.0, 1.7, 3.4, 0.9 \rbrace $

$A = \lbrace 1.0, 1.2, 2.0 \rbrace$ and $B = \lbrace 1.7, 3.4, 0.9 \rbrace$

So $A \cup B = S$

I need to calculate the statistics/moments for $A$ and $B$, then combine them to give the statistics/moments for $S$


Count is simple: $n_S = n_A + n_B$

Mean is not much worse: $\mu_S = (n_A\mu_A + n_B\mu_B) / n_S$

Variance is a little less pretty: $\sigma_S = [n_A\sigma_A + n_B\sigma_B + (\frac{n_An_A}{n_A+n_B})(\mu_A - \mu_B)^2] / n_S$


But now I'm struggling for skewness and, in particular, kurtosis. I have all 'lesser' moments for each of the subsections of the data available and have some idea of the direction I'm heading, but am really struggling to derive the formulae needed.

Has anybody derived these formulae before? Could anyone point me in the right direction? These may be simple/obvious things to any with anyone with a decent amount of statistical knowledge, unfortunately that's something I completely lack...

1 Answers 1

2

I happened to solve exactly this problem at my previous job.

Given samples of size $n_A$ and $n_B$ with means $\mu_A$ and $\mu_B$, and you want to calculate the mean, variance etc for the combined set $X=A\cup B$. A pivotal quantity is the difference in means

$\delta = \mu_B - \mu_A$

This already appears in your formula for variance. You could re-write your formula for the mean to include it as well, although I won't. I will, however, re-write your formulas to work with extensive terms (sums, sums of squares) rather than intensive terms (means, variances)

$S_X = n_X\mu_X = n_A\mu_A + n_B\mu_B = S_A + S_B$

$S^2_X = n_X \sigma_X^2 = n_A\sigma_A^2 + n_B\sigma_B^2 + \frac{n_A n_B}{n_X} \delta^2 = S^2_A + S^2_B + \frac{n_A n_B}{n_X} \delta^2$

Note that $S^j_X$ is the sum of differences from the mean, to the power $j$.

The formula for the sum of third powers, $S^3_X$, is

$S^3_X = S^3_A + S^3_B + \frac{n_A n_B (n_A-n_B)}{n^2_X} \delta^3 + 3 \frac{n_A S^2_B - n_B S^2_A}{n_X} \delta$

and for the sum of fourth powers

$S^4_X = S^4_A + S^4_B + \frac{n_A n_B (n_A^2 - n_A n_B + n_B^2)}{n^3_X} \delta^4 + 6\frac{n^2_A S^2_B + n^2_B S^2_A}{n^2_X} \delta^2 + 4\frac{n_A S^3_B - n_B S^3_A}{n_X} \delta $

Once you have these quantities, you can calculate the quantities you're interested in:

$\mu_X = \frac{S_X}{n_X}$

$\sigma^2_X = \frac{S^2_X}{n_X}$

$s_X = \frac{\sqrt{n_X}S^3_X}{(S^2_X)^{3/2}}$

$\kappa_X = \frac{n_X S^4_X}{(S^2_X)^2}$

Needless to say, you should write unit tests that compare the output from these formulas to the ones computed in the 'traditional' way to make sure that you (or I) haven't made a mistake somewhere :)

  • 0
    Chris, thanks for all this, you've certainly saved me some hours of deriving this myself. Thanks for the link to the paper too, I'm already using incremental algorithms but some of the other items may well come in handy later on.2012-10-22