6
$\begingroup$

If shuffle-playing playlist ×100 resulted in [10 13 10 3 2 2] different songs being repeated [1 2 3 4 5 6] times, what is the estimate for the total number of songs? (assuming shuffle play was completely random)

Update: (R code)

k <- 50 # k number of songs on the disk indexed 1:k n <- 100 # n number of random song selections m <- 20 # m number of repeat experiments colnum <- 10 mat <- matrix(data=NA,nrow=m,ncol=colnum) df <- as.data.frame(mat) for(i in 1:m){   played <- 1+floor(k*runif(n)) # actual song indices (1:k) selected    freq <- sapply(1:k,function(x){sum(played==x)})   # = number of times song with index x is being played   histo <- sapply(1:colnum,function(x){sum(freq==x)});   for(j in 1:colnum){    df[i,j] <- histo[j]   } } df 

Resulting in: e.g. 20 distributions (V1=number of single plays, V2=number of double plays, etc):

   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1  15 13 11  1  2  2  0  0  0   0 2  15 12 10  1  4  0  1  0  0   0 3  12 14  7  6  3  0  0  0  0   0 4  17 16  6  4  2  0  1  0  0   0 5  17 10 12  5  0  0  1  0  0   0 6  13 15 11  6  0  0  0  0  0   0 7  10 14  9  3  2  1  1  0  0   0 8  12 17  5  6  3  0  0  0  0   0 9   9 19  8  3  1  2  0  0  0   0 10 13  9 11  6  1  0  1  0  0   0 11 16  9 12  5  2  0  0  0  0   0 12 15  9 11  6  2  0  0  0  0   0 13 19  9  7  4  4  1  0  0  0   0 14 17 11  4  7  3  1  0  0  0   0 15 11 20  8  1  3  1  0  0  0   0 16 14 12 10  5  0  2  0  0  0   0 17  9 12  8  7  3  0  0  0  0   0 18 10 15  9  4  2  0  1  0  0   0 19 14 11 12  7  0  0  0  0  0   0 20 16 14 11  3  1  1  0  0  0   0 

Now I need to get from here to the Poisson modelling--my R is a bit rusty (?lmer)...--Any help would be appreciated...

Attempted Poisson modelling: disappointing fit?!

plot(1:colnum,df[1,1:colnum],ylim=c(0,30),    type="l",xlab="repeats",ylab="count") for(i in 1:m){   clr <- rainbow(m)[i]   lines(1:colnum,df[i,1:colnum],type="l",col=clr)   points(1:colnum,df[i,1:colnum],col=clr) }  df.lambda=data.frame(lambda=seq(1,5,0.1),ssq=c(NA));df.lambda for(ii in 1:dim(df.lambda)[1]){   l <- df.lambda$lambda[ii]       ssq <- 0       for(i in 1:20){         for(j in 1:10){           ssq <- ssq + (df[i,j] - n*dpois(j,l))^2         }       }       print(ssq)       df.lambda$lambda[ii] <- l   df.lambda$ssq[ii] <- ssq     }     df.lambda     lambda.est <- df.lambda$lambda[which.min(df.lambda$ssq)] lambda.est # 2.4 points(x <- 1:10, n*dpois(1:10,lambda.est),type="l",lwd=2)  100*dpois(1:10,3) n/lambda.est 

the estimated lambda stays around 2.3, with an n estimate of around 43; the fitted curve seems very discrepant, and seems to worsen with rising n !? Doesn't this have to do with the fact, that our repeats are different from the 'classical' Poisson distributions: it's not just ONE event that repeats itself x number of times, but the sum of repeats of different items (songs)?!

  • 0
    In that case the song selection may not be random. If I have two versions of the same song title on my iPhone, they are always (?) played sequentially. Otherwise, maybe you are expecting too good a fit. The higher numbers probably have high variance, but I don't know how to calculate it.2012-06-17

3 Answers 3

2

Using the notation of leonbloy, also let $N_1=10$ be the number of songs that were played once. A simple Good-Turing estimator for $M$ is then given by $ \hat M = {{P} \over {1-{N_1 \over N} }}= {{40} \over {1-{10 \over 100}}}=44.4,$ which agrees nicely with the maximum likelihood approach.

4

You can approximate this by a Poisson distribution, which is more valid as the playlist gets longer. You just need to find the value of $\lambda$, the average number of times a song has been played that best fits your data. This is $\frac {100}{number of songs}$ Any numerical analysis book can help, or you can just use the goal seek in Excel on the sum squared error in each bin. As you played 40 different songs, the average song that got played was played $2.5$ times. If we use that for $\lambda$, the chance that a song is not played is about $8.2\%$ so there might be another $3$ songs or so. You even can make a start at checking whether the selection is really random by doing a chi-square test for the distribution.

  • 0
    the Poisson fitting attempt seems rather unsatisfactory (see code and comments above). It feels somehow that this is the wrong track. I think the 'repeats' are different from the usual Poisson repeats...2012-06-16
2

You can compute the likelihood: Let $M$ be the total (unknown) number of songs, $n_i$ the amount of songs that appeared $i$ times, so $ n_1 + 2 n_2 + 3 n_3 \cdots 6 n_6= N = 100$, and let $P=n_1 + n_2 +\cdots + n_6 = 40$

Then, the likelihood is

${\mathbb l}(M) = \left(\frac{1}{M}\right)^N {M \choose n_1} {M -n_1 \choose n_2} \cdots {M - P + n_6 \choose n_6} K = \left(\frac{1}{M}\right)^N \frac{M!}{n_1! n_2! \cdots (M-P)!} K$

where $K$ counts the permutations of $N$ elements with $n_2, n_3 \cdots$ repeated elements - which does not depend on $M$. Taking the loglikehood, ignoring elements that do not depend on $M$ and using the Stirling approximation, we get

$L(M)= -N \log(M) + M ( \log(M) -1) -(M-p) (\log(M-P)-1)$

The maximum of this log-likehood occurs between 44 and 45.