1
$\begingroup$

I have a dataset of the counts of each user visiting a set of websites in a year (each user visits at least 1 website in my data). Half of the users visit 7 or fewer sites though the top user visits 9384 sites. I want to find a count distribution that can fit the data well but it seems challenging.

Here is the data summary:

  • 46285 observations
  • Mean: 33.1
  • Std. Dev.: 138.5
  • Skewness: 20.0 Kurtosis: 808.1

Percentile: Value - Smallest

  • 1%: 1 - 1
  • 5%: 1 - 1
  • 10%: 1 - 1
  • 25%: 1 - 3

Median

  • 50%: 7

Percentile: Value - Largest

  • 75%: 19 - 4947
  • 90%: 53 - 5281
  • 95%: 116 - 7111
  • 99%: 522 - 9384

I tried Poisson which obviously doesn't work because mean << std. dev. Negative binomial does not do too much better.

Any suggestions?

Thanks!

  • 0
    That's not even a distribution isn't it. I can't ask what the probability that the number of visits is $k$ is. For what it's worth, I tried exponential distribution http://en.wikipedia.org/wiki/Exponential_distribution and it's also a bad fit. Again it's not discrete distribution.2012-07-28

1 Answers 1

1

I don't know any nice families of count models that might fit this. Have you tried to find any explanatory variables that you might put into a count regression model? Maybe a covariate can explain the extreme values and you can get a better fit. Another possibility is finding a mixture of count distributions.

  • 0
    I was thinking of a situation where you might have a mixture of two Poissons with different rates or a Poisson with a geometric variable. Or combinations of several. The possibilities are endless. But maybe the shape of the distribution and characteristics that you know about the hits could limit the number of models to consider.2012-07-31