In a density estimation problem, I fit a model, $\mathcal{M}$, based on my training data set $X_1$ (e.g. by using maximum likelihood), and I'd like to compare its performance on $X_1$ to its performance on a test set $X_2$, by comparing $p(X_1|\mathcal{M})$ and $p(X_2|\mathcal{M})$.
However I can't directly compare them, as these two sets don't have the same size. The "larger" set usually receives a smaller probability. How should I properly normalize by the number of data points?
(per-data point log likelihood seems to be commonly reported; see the last section of https://papers.nips.cc/paper/1191-regression-with-input-dependent-noise-a-bayesian-treatment.pdf for an example)