1
$\begingroup$

I have a dataset of the counts of each user visiting a set of websites in a year (each user visits at least 1 website in my data). Half of the users visit 7 or fewer sites though the top user visits 9384 sites. I want to find a count distribution that can fit the data well but it seems challenging.

Here is the data summary:

  • 46285 observations
  • Mean: 33.1
  • Std. Dev.: 138.5
  • Skewness: 20.0 Kurtosis: 808.1

Percentile: Value - Smallest

  • 1%: 1 - 1
  • 5%: 1 - 1
  • 10%: 1 - 1
  • 25%: 1 - 3

Median

  • 50%: 7

Percentile: Value - Largest

  • 75%: 19 - 4947
  • 90%: 53 - 5281
  • 95%: 116 - 7111
  • 99%: 522 - 9384

I tried Poisson which obviously doesn't work because mean << std. dev. Negative binomial does not do too much better.

Any suggestions?

Thanks!

  • 0
    For things like this, the received wisdom seems to be a power law: http://en.wikipedia.org/wiki/Power_law How does that fit?2012-07-28
  • 0
    That's not a count model though isn't it. The values here have to be count values.2012-07-28
  • 0
    you make your data continuous, so the probability of a given number goes as $n^{-k}$ for some $k$ chosen to fit the data.2012-07-28
  • 0
    That's not even a distribution isn't it. I can't ask what the probability that the number of visits is $k$ is. For what it's worth, I tried exponential distribution http://en.wikipedia.org/wiki/Exponential_distribution and it's also a bad fit. Again it's not discrete distribution.2012-07-28

1 Answers 1

1

I don't know any nice families of count models that might fit this. Have you tried to find any explanatory variables that you might put into a count regression model? Maybe a covariate can explain the extreme values and you can get a better fit. Another possibility is finding a mixture of count distributions.

  • 0
    Unfortunately I don't have any user characteristics data. I do have count data for total number of visits (mentioned in the post), pages, minutes, and for each website I know the number of visits from users of 1 of 5 types (but not what type each user is). The mixture idea - how do I go about implementing it? What mixture do you have in mind?2012-07-29
  • 0
    I was thinking of a situation where you might have a mixture of two Poissons with different rates or a Poisson with a geometric variable. Or combinations of several. The possibilities are endless. But maybe the shape of the distribution and characteristics that you know about the hits could limit the number of models to consider.2012-07-31