I'm trying to prove, that sizes of files stored on HDD comes from normal distribution.
I already made a java code that gave me list of all files on disk with filesizes in Bytes:
... 20380 html 21459 html 271 gif 271 gif 145 gif 72 gif 838 gif ...
Because of huge amount of files (about 500k) seemed continuous distribution like good idea, I'm aiming to intervals after all, so why would be absence of real numbers a problem. So the resulting intervals are:
... (400, 520> 12464x (520, 676> 7157x (676, 879> 90909x (879, 1140> 15184x (1140, 1480> 18992x (1480, 1920> 17396x (1920, 2490> 16042x (2490, 3230> 16132x (3230, 4190> 14016x ...
(http://pastebin.com/CX5tLPV5)
for parameters I have :
$ \widehat{\mu } = \overline{x} = \frac{\sum_{i=1}^{n} x_i}{n} = 844917 $ $ \widehat{\sigma} = s = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \overline{x})^{2}}{ n - 1 }} = 4281667 $
from $ \pi_{0,i} = F(x_i) - F(x_{i-1}) $ I expect to get 'expected relative freq.' but what means Xi? Some certain number from i-th interval?
Furthermore, according to wiki $ f(x; \mu , \sigma ^2) = \frac{1}{\mu \sqrt{2 \pi}}\cdot e^{-\frac{1}{2}(\frac{x - \mu}{\sigma})^2} $ should give me point on bell curve, but it goes for lower sides of intervals
... 9.317452E-8 5.6513205E-8 5.6513205E-8 1.26098E-8 1.0350754E-9 1.0350754E-9 3.4722914E-13 1.419046E-15 1.1799776E-21 4.9485846E-34 0.0 ...
using tables on the other hand gives me for x=3475374080
$ u = \frac{x-\mu}{\sigma} = 811 $
which is quite big.
I'm so confused it can't get any worse I guess. So what am I missing? Any help would be appreciated.