Let $\mathcal{D}=\{ (y_1,x_1), ..., (y_N,x_N) \}$ be i.i.d. training data. Let $\widetilde P(Y,X)$ be the empirical distribution. The goal is to maximize the likelihood of $P(\mathcal{D}|\;\Theta)$.
I am having difficulties with this line: $P(\mathcal{D}|\Theta) = \prod_{y,x} P(Y=y|\;X=x; \Theta)^{\widetilde P(Y=y,X=x)}$
Specifically, it is the exponent ${\widetilde P(Y=y,X=x)}$ which I do not understand why it is there.
Please help me understand why this is so or point me to some reference which could help me understand it.
One line of thought I had, is that perhaps it is the product over all possible $y,x$ (not only those in the training sample), then I would understand if it was
$P(\mathcal{D}|\Theta) = \prod_{y,x} P(Y=y|\;X=x; \Theta)^{\widetilde P(Y=y,X=x)|\mathcal{D}|}$
that would work if each $y,x$ occured in $\mathcal{D}$ and some occured more than once. But then again if not each $y,x$ is in $\mathcal{D}$, then for those the exponent would be 0 and thus the product would include the probability of that event despite it not being in the training data. Needless to say, I am lost. It's probably due to some (fundamental) misunderstanding on my side. Thankful for any help.
