0
$\begingroup$

I'm trying to prove, that sizes of files stored on HDD comes from normal distribution.

I already made a java code that gave me list of all files on disk with filesizes in Bytes:

...
20380 html
21459 html
271 gif
271 gif
145 gif
72 gif
838 gif
...

Because of huge amount of files (about 500k) seemed continuous distribution like good idea, I'm aiming to intervals after all, so why would be absence of real numbers a problem. So the resulting intervals are:

...
(400, 520> 12464x
(520, 676> 7157x
(676, 879> 90909x
(879, 1140> 15184x
(1140, 1480> 18992x
(1480, 1920> 17396x
(1920, 2490> 16042x
(2490, 3230> 16132x
(3230, 4190> 14016x
...

(http://pastebin.com/CX5tLPV5)

for parameters I have :

$$ \widehat{\mu } = \overline{x} = \frac{\sum_{i=1}^{n} x_i}{n} = 844917 $$ $$ \widehat{\sigma} = s = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \overline{x})^{2}}{ n - 1 }} = 4281667 $$

from $$ \pi_{0,i} = F(x_i) - F(x_{i-1}) $$ I expect to get 'expected relative freq.' but what means Xi? Some certain number from i-th interval?

Furthermore, according to wiki $$ f(x; \mu , \sigma ^2) = \frac{1}{\mu \sqrt{2 \pi}}\cdot e^{-\frac{1}{2}(\frac{x - \mu}{\sigma})^2} $$ should give me point on bell curve, but it goes for lower sides of intervals

... 9.317452E-8
5.6513205E-8
5.6513205E-8
1.26098E-8
1.0350754E-9
1.0350754E-9
3.4722914E-13
1.419046E-15
1.1799776E-21
4.9485846E-34
0.0 ...

using tables on the other hand gives me for x=3475374080

$$ u = \frac{x-\mu}{\sigma} = 811 $$

which is quite big.

I'm so confused it can't get any worse I guess. So what am I missing? Any help would be appreciated.

  • 0
    Welcome to Math.SE! I'm sorry that your question went without a response for a week. Most likely, this is because your question does not fall under *probability*, but rather into statistics or data analysis. There is a [separate StackExchange site](http://stats.stackexchange.com/) for questions on these topics.2013-01-02
  • 0
    That said, trying to fit filesizes into a normal distribution seems unlikely to produce useful results. Since files are not made by compounding independent chunks; there is no reason for their sizes to follow a normal distribution. Maybe you could begin by comparing the histogram of your data to graphs of commonly used distributions?2013-01-02
  • 0
    When I painted intervals above into graph, It was pretty much like normal distribution, except for a one tooth around files 600B to 900B on pretty much every system I tested. Of course on exponential scale, on linear scale is it probably colsest to exponential distribution, and with that I loose pretty much all the information, and thus loosing even a slightest chance of predicting the curve.2013-01-23
  • 0
    So I guess you are actually using the [log-normal distribution](http://en.wikipedia.org/wiki/Log-normal_distribution)? Then you should be taking the logarithm of each file size, and use those logarithms to calculate $\bar x$ and $\sigma$. The value $\bar x=844917$ does not look like it came from logarithms.2013-07-15

0 Answers 0