Here's a problem that I have pondered over many times without ever coming to a satisfactory solution:
Let's say that we have a series of random events: V(i) for I = 1 to n. Each of these events will have a result VR(i) of either 0 or 1, and a probability VP(i) (from zero to one) that represents the probability that VR(i) = 1.
We also have a group of k estimators: E(j) for j = 1 to k, that generate EP(j,i) which are estimates of VP(i).
What I want is an evaluation function F(j) that given EP(j,i={1..n}) (that is, E(j)'s estimates of VP(i) for i = 1 to n), and VR(i={1 to n}) (i.e., the actual results), that will return a number that is a "good" valuation of E(j)'s ability to estimate VP(i).
Some Notes:
The results, VR(i), are known to the Evaluation function, but (obviously) not to the Estimators, neither before or after each result (so the estimators cannot use the VR(i)'s to adjust their subsequent predictions, if that matters).
The probabilities, VP(i), are not known to the Evaluation function.
The distribution of the probabilities VP(i) is not know to the Evaluation function.
EP(j,i) is supposed to be an estimate of the event probabilities, VP(i), and not an estimate of the results, VR(i). Virtually every valuation system that I have seen tends to weight each EP(j,i) solely on its closeness to VR(i), which invariably rewards "polarized" estimators: those that always return 1 if VP(i) > 0.5 and 0 if VP(i) < 0.5.
One hallmark of the problem in #4 is that if VP(i) = 0.5, then these types of valuation systems will tend to reward all EP(j,i)'s the same, or else give the highest reward to EP's of 0 or 1, and the worst valuation to the actual correct estimate, 0.5. What I would like, of course, is just the opposite: for estimates of 0 or 1 for actual probabilities of 0.5 to receive the worst valuation while the correct estimates of 0.5 to be rated the highest.
This I think is also the essence of the difficulty in this problem: how do I correctly value the probability estimates (and estimators) if I only know the event results, but not the actual probabilities being estimated?