I'm reading an introduction to Bayesian updating which includes the following example (by the way, I will add my reasoning at some parts so you guys can tell me whether I'm correct or not):
Let $\mu_{m}$ be the model that asserts $P(head) = m$. Let s be a particular sequence of observations yielding i heads and j tails. Then, for any m, $0\leq m \leq 1$:
$P(s|\mu_{m}) = m^{l}(1-m)^{j} \tag{*}$
Good enough. It's a situation where there are two possible outcomes (heads or tails), so a distribution like that is sensible enough. Basically, I'm asking what is the probability of having a given sequence of heads and tails ($s$) given a model with some m. This example goes on to describe the difference between a frequentist approach and a bayesian approach and says:
But now suppose that one wants to quantify one's belief that the coin is probably a regular, fair one. One can do that by assumming prior probability distribution over how likely it is that different models $\mu_{m}$ are true
.
Because of some nice properties, the author chooses the following distribution:
$P(\mu_{m})=6m(1-m)$
So, if I want to know the new probability of having a fair or unfair coin (which is equivalent to asking the probability for a given model $\mu_{m}$) to be true then I have to use Bayes' theorem:
$P(\mu_{m}|s) = \frac{P(s|\mu_{m})P(\mu_{m})}{P(s)} \tag{1}$
Good, we plug in our prior probability $P(\mu_{m})$ and our value for $P(s|\mu_{m})$ in $(1)$, which gives $\displaystyle \frac{6m^{i+1}(1-m)^{j+1}}{P(s)}$. We don't need the denominator to calculate the maximum value $m$ for which $(1)$ is a maximum probability, that is, for that value of $m$ the model is more likely to be true given the data $s$. It turns out that, in this case, that value is $\frac{3}{4}$ (assuming 8 heads and 2 tails).
Now, we never calculated $P(s)$ but the author says that this is a marginal probability, so we need to do this:
$P(s) = \int_{0}^{1}P(s|\mu_{m})P(\mu_{m})dm \tag{2}$
I sort of understand why we do this. We want the prior probability of $s$ but we can't ask the probability of having $s$ (a given series of heads and tails) without any assumption. However, assuming a given model $(*)$, we can ask that for some $m$ and of course, not only that, we have to consider every m. The fact that it's a weighted sum is reasonable but I don't have a good explanation for it. Obviously, that's the way marginal probabilities work but still, it's a bit uncomfortable just to assume that's the way it is. I'd like to know why. Another question: Is the denominator in Bayes's equation always a marginal (prior) probability?
The solution for $(2)$ is $P(s)=\displaystyle \frac{6(i+1)!(j+1)!}{(1+j+3)!}$.
Do I need to use that $P(s)$ in $(1)$ to get a probability of a particular model $\mu_{m}$ being true given a sequence of heads/tails $s$? If I choose to use $m=1/2$, am I disregarding the maximum value $m=3/4$ in favor of the assumption of having a fair coin?
Thanks a lot