I'm trying to sample from a data set using a binomial distribution with parameters p and n.
Implementation-wise, I follow these steps
I generate an array containing the values of the cumulative distribution function (cdf) for a binomial distribution with parameters p and n:
cdf[0] = P(X <= 0); cdf[ 1 ] = P(X <= 1); ... cdf[n] = P(X <= n);
I iterate through the data set and for each record in the data set:
- I use a Random Number generator to generate a number
- I search the position where the generated number would fit in the cdf and take the cdf[i] to the left of the position returned by the search
- I sample that record i times because cdf[i] = P(X <= i)
The problem is that I would expect that in ONE run of this algorithm the average of the i's at point 2.3 above (number of times a record is sampled)to be n*p which is the mean of a binomial distribution. Unfortunately it isn't. Is this related to the fact that I run the algorithm only once ? Should I have this kind of expectation only when I run the algorithm a sufficiently large number of times ?
Can you suggest me how I could determine the i's (the number of times a record should be sampled) given that these i's should follow a binomial distribution?