I have about 3.5GB worth of data that I need to get statistics from, all split up into about 71MB files, with an (approximately) equal number of samples. From this data, I would like to gather mean and standard deviation. Parsing the entirety of the standard deviation is probably a bad idea, since it's 3.5GB.
However, I know that with mean, I can at least (with some accuracy, since they're approximately the same size), take the average for each file, and then take the average of each of those sub-averages.
With standard deviation, it's a little more tricky, though. I've taken some time to run tests and found out that the standard deviation of a large sample size seems to be approximately similar to the average of the standard deviations of equivalently sized smaller chunks of samples. Does this actually hold true, or was that just a coincidence within the few tests that I've run? If it does hold true, then can I calculate what my percent error is probably going to be? Finally, is there a more accurate way to test for standard deviation that doesn't require me mulling over 3.5GB of data at a time?