4
$\begingroup$

I have about 3.5GB worth of data that I need to get statistics from, all split up into about 71MB files, with an (approximately) equal number of samples. From this data, I would like to gather mean and standard deviation. Parsing the entirety of the standard deviation is probably a bad idea, since it's 3.5GB.

However, I know that with mean, I can at least (with some accuracy, since they're approximately the same size), take the average for each file, and then take the average of each of those sub-averages.

With standard deviation, it's a little more tricky, though. I've taken some time to run tests and found out that the standard deviation of a large sample size seems to be approximately similar to the average of the standard deviations of equivalently sized smaller chunks of samples. Does this actually hold true, or was that just a coincidence within the few tests that I've run? If it does hold true, then can I calculate what my percent error is probably going to be? Finally, is there a more accurate way to test for standard deviation that doesn't require me mulling over 3.5GB of data at a time?

  • 0
    You can do it in one pass. See my answer.2011-05-04

3 Answers 3

5

Posting as an answer in response to comments.

Here's a way to compute the mean and standard deviation in one pass over the file. (Pseudocode.)

n = r1 = r2 = 0; while (more_samples()) {     s = next_sample();     n += 1;     r1 += s;     r2 += s*s; } mean = r1 / n; stddev = sqrt(r2/n - (mean * mean)); 

Essentially, you keep a running total of the sum of the samples and the sum of their squares. This lets you easily compute the standard deviation at the end.

  • 0
    @ashays: If the variance is not much tinier than the data entries, then yes, one-pass is sufficient. If the variance is much smaller relative to the data entries, you have to use the corrected two-pass scheme.2011-05-06
1

You have a sampling problem. I would treat your large sample as a population, and then sample from that. First take a random sample of data, and test that the data is normally distributed and free of outliers. Then compute the sample standard deviation along with confidence intervals using a chi-square distribution.

$\sqrt{\frac{(n-1)s^2}{\chi^2\left(df,\frac{\alpha}{2}\right)}} < \sigma < \sqrt{\frac{(n-1)s^2}{\chi^2\left(df,1-\frac{\alpha}{2}\right)}}$

Bear in mind that there will always be a margin of error unless you compute the entire population, which seems impractical in your case.

  • 0
    s2 = the square of the computed standard deviation. $df$= degrees of freedom (sample size -1), and alpha is the signficance level (typical between .05 and .10)2011-05-04
1

You seem to have about $f=50$ files. If you know the mean $\mu_i$ and variance $\sigma_i^2$ (square of the standard deviation) and number of elements $n_i$ of each file, then your overall mean should be

$\mu = \frac{\sum_{i=1}^{f} n_i \mu_i }{\sum_{i=1}^{f} n_i}$

and your overall variance

$\sigma^2 = \frac{\sum_{i=1}^{f} n_i \left(\sigma_i^2 +(\mu_i-\mu)^2\right) }{\sum_{i=1}^{f} n_i}.$

If you have forgotten to collect the number of elements of each file but are confident they are each the same then you can use

$\mu = \frac{1}{f}\sum_{i=1}^{f} \mu_i $

which is the mean of the means, and

$\sigma^2 = \frac{1}{f}\sum_{i=1}^{f} \left(\sigma_i^2 +(\mu_i-\mu)^2\right) .$

The wrong thing to do would be taking the average of the standard deviations (or even the average of the variances), since this would ignore the effect of the differences in the means and so produce a result which was too small.