1
$\begingroup$

It is well known that the Beta distribution serves as a conjugate prior for the Bernoulli distribution, and that when you observe a Bernouilli random variable, you need only increment the appropriate hyperparameter of the Beta distribution.

However when the Bernoulli distribution is "noisy" in the sense that you do not observe the the Bernoulli random variable directly, but instead observe a random variable that is equal to the Bernoulli random variable with probability 1-p and flipped with probability p, where p is known, and represents an error rate in observing the Bernoulli random variable, the posterior distribution obtained is a linear combination of two Beta distributions.

In the case when p=0, p=.5, and p=1, the posterior distribution is again Beta, but for other values of p, this is not the case.

In my particular application analytic tractability is important. Is there a conjugate prior that would be appropriate for this type of problem?

Failing that, it seems there might be a sensible way to update the hyperparameters of the Beta distribution in an approximate sense. Intuitively, when you observe a Bernoulli random variable with error rate p, the information you have about the parameter theta of the Bernoulli distribution is nothing when p=.5 (observations are completely uninformative) and maximum when p=0 or p=1, and somewhere in between for other values of p. More specifically, in the case when p=1 or p=0, the sum of the hyperparameters of the beta in the posterior distribution is 1 greater than the sum in the prior, and for p=.5, the sum of the hyperparameters remains the same. For other values of p, the change in the sum of the hyperparameters should be intermediate, but I'm not sure how to best choose them.

  • 0
    Why not just use the beta prior and do Gibbs sampling using data augmentation?2012-12-18

2 Answers 2

0

I'm not sure how much help this will be, but I would approach it like this - it's rather flaky and informal...

If $\alpha$ is the probability of the original trial ($x \in \{1,0\}$), and $\beta$ is the probability of flipping (defining $p(y|x)$) then

$p(y=0; \alpha, \beta) = 1-\alpha-\beta+2\alpha\beta = \gamma$ $p(y=1; \alpha, \beta) = \alpha+\beta-2\alpha\beta = 1-\gamma$

The Jeffreys prior is

$p_{Jeff}(\alpha) \propto \sqrt{ \frac{(1-2\beta)^2}{\gamma(1-\gamma)}} $

the $\beta$ term is multiplicative and makes no difference except at 1/2, which we can basically ignore for reasons of continuity as just say.

$p_{Jeff}(\alpha) \propto \frac{1}{\sqrt{\gamma(1-\gamma)}}$

Basically, this is just a beta distribution over $\gamma$ with parameters 1/2 and 1/2, which you update with your values of $y$. You can write this in terms of $\alpha$ if you wish.

$p(\alpha | y^N) = \frac{ \gamma^{Y - 1/2}(1-\gamma)^{N-Y - 1/2}}{B(Y+1/2,N-Y+1/2)}$

where $Y$ is the number of observations of $y=0$

  • 0
    Probably not. While I don't know the error rate for each of the sources exactly, the betas are likely to be something I hard code values for. With that said I can probably get an estimate of the betas after running this whole procedure based on what I know about our business. How many customers classed as females in our data that have typical male buying patterns would give a rough indication of how good the source of gender data is, for example.2012-12-17
0

What I ended up trying is approximating the linear combination of beta distributions that represent the exact posterior distribution with a single beta distribution. Specifically, the approximation method I used was to minimize the cross entropy between linear combination of betas and the approximating beta distribution.

The results of the minimization aren't closed form as far as I can tell, but the functions involved are all pretty tame, so it's not hard for an optimizer with gradients, like the BFGS method available in the python SciPy optimize package.

The cross entropy turns out to be $ \ln B(\alpha',\beta') - (\alpha'-1)[p \psi(\alpha+1)+(1-p)\psi(\alpha)] -(\beta'-1)[p\psi(\beta)+(1-p)\psi(\beta+1)]+(\alpha'+\beta'-2)\psi(\alpha+\beta+1) $

where $\alpha$, $\beta$ are the parameters of the prior distribution, $\alpha', \beta'$ are the parameters of the approximating beta distribution and $\psi$ is the digamma function.