0
$\begingroup$

Let $\mathcal{D}=\{ (y_1,x_1), ..., (y_N,x_N) \}$ be i.i.d. training data. Let $\widetilde P(Y,X)$ be the empirical distribution. The goal is to maximize the likelihood of $P(\mathcal{D}|\;\Theta)$.

I am having difficulties with this line: $P(\mathcal{D}|\Theta) = \prod_{y,x} P(Y=y|\;X=x; \Theta)^{\widetilde P(Y=y,X=x)}$

Specifically, it is the exponent ${\widetilde P(Y=y,X=x)}$ which I do not understand why it is there.

Please help me understand why this is so or point me to some reference which could help me understand it.

One line of thought I had, is that perhaps it is the product over all possible $y,x$ (not only those in the training sample), then I would understand if it was

$P(\mathcal{D}|\Theta) = \prod_{y,x} P(Y=y|\;X=x; \Theta)^{\widetilde P(Y=y,X=x)|\mathcal{D}|}$

that would work if each $y,x$ occured in $\mathcal{D}$ and some occured more than once. But then again if not each $y,x$ is in $\mathcal{D}$, then for those the exponent would be 0 and thus the product would include the probability of that event despite it not being in the training data. Needless to say, I am lost. It's probably due to some (fundamental) misunderstanding on my side. Thankful for any help.

  • 1
    You're welcome. I almost made this an answer originally; I can't recall why I didn't ;-)2012-01-10

1 Answers 1

1

If the exponent is $0$, that factor in the product is $1$, so it makes no difference whether the product runs over all $x,y$ or only those in the training set. Regarding the factor of $|\mathcal D|$, I agree that it should be there. However, note that it doesn't change the maximization problem, since you can take it outside the product and it just becomes a constant factor multiplying the log likelihood.