0
$\begingroup$

In a density estimation problem, I fit a model, $\mathcal{M}$, based on my training data set $X_1$ (e.g. by using maximum likelihood), and I'd like to compare its performance on $X_1$ to its performance on a test set $X_2$, by comparing $p(X_1|\mathcal{M})$ and $p(X_2|\mathcal{M})$.

However I can't directly compare them, as these two sets don't have the same size. The "larger" set usually receives a smaller probability. How should I properly normalize by the number of data points?

(per-data point log likelihood seems to be commonly reported; see the last section of https://papers.nips.cc/paper/1191-regression-with-input-dependent-noise-a-bayesian-treatment.pdf for an example)

  • 0
    A week with no answer. I wondering if this would get better attention at our sister site 'stat.stackexchange', especially because the question has a cross-validation, machine-learning flavor. // But I'm curious whether you re-estimate parameters of $M$ with new data $X_2$ before the attempted comparison.2017-01-28
  • 0
    Thanks for the suggestion; I'll try my luck on the stats site. No, I use same model $\mathcal{M}$ for $X_2$, which I assume is generated by the same underlying distribution as $X_1$2017-01-28

0 Answers 0