12
$\begingroup$

Suppose I have independent random variables $X_i$ which are distributed binomially via $$X_i \sim \mathrm{Bin}(n_i, p_i)$$.

Are there relatively simple formulae or at least bounds for the distribution $$S = \sum_i X_i$$ available?

4 Answers 4

9

See this paper (The Distribution of a Sum of Binomial Random Variables by Ken Butler and Michael Stephens).

  • 0
    I have the same question and i read the paper (The Distribution of a Sum of Binomial Random Variables by Ken Butler and Michael Stephens). unfortunately the approximations are not clear to me ( for example how are the probabilities in Table 2 calculated?)2012-12-06
  • 0
    @May: I had the same problem these days and I ended up using the explicit formula given in the linked paper. Should be fine if you don't have too many samples.2012-12-07
  • 1
    @MichaelKuhn: so you used poisson binomial distribution function http://en.wikipedia.org/wiki/Poisson_binomial_distribution, unfortunately I have many samples and I need to use an approximation.2012-12-11
  • 0
    @May: These seems to be an R package for this distribution: http://cran.r-project.org/web/packages/poibin/poibin.pdf2012-12-11
  • 0
    @MichaelKuhn: thank you. I also found this paper http://www.sciencedirect.com/science/article/pii/S0167947312003568#2012-12-11
  • 0
    @MichaelKuhn: Unfortunately I use MATLAB in my work, and I could not find MATLAB package for this distribution.2012-12-11
  • 1
    `Page Not Found` ...2015-09-13
  • 0
    It is not beautiful, but I implemented the algorithm from this paper in R for another project a few years ago at https://github.com/jvbraun/sumbin2016-09-18
4

This answer provides an R implementation of the explicit formula from the paper linked in the accepted answer (The Distribution of a Sum of Binomial Random Variables by Ken Butler and Michael Stephens). (This code can in fact be used to combine any two independent probability distributions):

# explicitly combine two probability distributions, expecting a vector of 
# probabilities (first element = count 0)
combine.distributions <- function(a, b) {

    # because of the following computation, make a matrix with more columns than rows
    if (length(a) < length(b)) {
        t <- a
        a <- b
        b <- t
    }

    # explicitly multiply the probability distributions
    m <- a %*% t(b)

    # initialized the final result, element 1 = count 0
    result <- rep(0, length(a)+length(b)-1)

    # add the probabilities, always adding to the next subsequent slice
    # of the result vector
    for (i in 1:nrow(m)) {
        result[i:(ncol(m)+i-1)] <- result[i:(ncol(m)+i-1)] + m[i,]
    }

    result
}

a <- dbinom(0:1000, 1000, 0.5)
b <- dbinom(0:2000, 2000, 0.9)

ab <- combine.distributions(a, b)
ab.df <- data.frame( N = 0:(length(ab)-1), p = ab)

plot(ab.df$N, ab.df$p, type="l")
3

One short answer is that a normal approximation still works well as long as the variance $\sigma^2 = \sum n_i p_i(1-p_i)$ is not too small. Compute the average $\mu = \sum n_i p_i$ and the variance, and approximate $S$ by $N(\mu,\sigma)$.

  • 0
    Unfortunately, I cannot say anything about the Variance. In what direction would the normal approximation go?2011-03-30
  • 1
    Do you mean, "is the normal approximation an overestimate or an underestimate?" That depends on the range of values you are considering. Both distributions have total mass $1$.2011-03-30
  • 0
    I'd be interested in an estimate on the expected value.2011-03-31
  • 0
    To use a normal approximation, you have to know the mean and variance. (To use the more complicated approximations in the paper PEV cited, you need more information, such as the first 4 moments.) If you don't know the expected value, then what do you know about these binomial summands?2011-04-03
  • 0
    I know $n_i$ and $p_i$ of each of the summands, and hence I know the expected value and variance of each of the summands.2011-04-03
2

It is possible to get a Chernoff bound using the standard moment generating function method: $$ \begin{align} \Pr[S\ge s] &\le E[\exp[t \sum_i X_i]]\exp(-st) \\&= \exp\left(\sum_i 1 + (e^t-1) p_i\right) \exp(-st) \\&\le \exp\left(\sum_i \exp((e^t-1) p_i)-st\right) \\&= \exp\left(s-\sum_ip_i-s\log\frac{s}{\sum_i p_i}\right) \end{align}, $$ where we took $t=\log(s/\sum_ip_i)$. This is basically equal to the standard Chernoff bound for equal probabilities, just replaced with the sum (or average if you set $s=n s'$.)

Here we (surprisingly) used the inequality $1+x\le e^x$, but a slightly stronger bound may be possible without it. It'll just be much more messy.

Another way to look at the bound is that we bound each variable with a poisson distribution with the same mean.