2
$\begingroup$

Here is a problem I'm working on, for which I have trouble finding the correct formulation of some information measures.

Problem : Suppose $N$ urns labelled $1,\ldots,N$. $C$ is a discrete probability distribution (the Choice distribution) such that $C(i)$ is the probability of putting a ball in urn $i$. As well, $O$ is a discrete probability distribution (the Observer distribution) such that $O(i)$ is the probability of putting a ball in urn $i$. $C$ is the real probability distribution which is used for putting a ball, whereas $O$ is what the observer believes was used for putting the same ball.

A ball is secretly put in a urn. One then reveals successively the content of urn 1, then 2, etc.

Question: I'd like to define two information measures : the "surprisingness" for the observer of observing a ball in urn $i$, and the "predictive information gain" for the observer about the content of urn $i+1$, before and after the content of urn $i$ is revealed, knowing the content of the urns $i-1$,$i-2$,etc. The question is : how to define such quantities ?

Where I am so far : I define $o_i$=expected probability of finding a ball in urn $i$, knowing that no ball was in urn $i-1$. Then,

$o_i=\frac{O(i)}{1-\sum_{k=1}^{i-1}O(k)}$

The surprisingness for the observer of finding the ball in urn $i$ would then just be:

$-\log(o_i)$

I define the predictive information gain as a Kullback-Leibler divergence :

$I(i)=A-B$

where

$A=(1-o_{i+1})\log[1-o_{i+1}]+o_{i+1}\log[o_{i+1}]$

and

$B=(1-o_{i+1})\log[(1-o_{i+1})(1-o_{i})+o_{i}]+o_{i+1}\log[o_{i+1}(1-o_{i})]$

Is this correct ?

  • 0
    I'm sorry if I still don't follow, but in this case isn't the comparison given by the $o_i$ ? Isnt it Bayes in disguise ?2012-02-25

1 Answers 1

6

I have to admit I am still a little confused by your formulation of the problem. Let me make sure I am understanding you correctly. I will try to use a standard notation, so that others will be more likely to participate.

We have a random variable $X$, representing in which of $N$ urns a single ball is secretly placed. It has some discrete probability distribution, corresponding to your $C(i)$: $ P(X=i)=x_i\quad\text{for}\quad1\le i\le N. $

Then we have another random variable $Y$, representing where a ball might be found. It has another discrete probability distribution, corresponding to your $O(i)$: $ P(Y=i)=y_i\quad\text{for}\quad1\le i\le N. $ Now you are calling this where an observer expects or (mentally?) places the ball. But this is a distribution, representing probabilities for the ball being in a given -- each possible, but just one particular -- place. Philosophically, then, this is identical to the concept of a prior distribution in Bayesian inference. It should be pretty obvious that the best prior to choose would be $P(X=i)$, but I am assuming that the observer is not privy to this information -- he or she does not know the values of the $x_i$, or how many distinct or nonzero values there are, or even the amount of information contained/encoded in $X$, called the Shannon or information entropy: $ H(X) = -\sum x_i \log x_i = \sum x_i I(x_i) $ Without this information, it should be pretty obvious that the best prior to choose will be the uniform distribution, i.e. $y_i=\frac1N$, but the observer is free to choose any distribution. The other quantity referenced above, $ I(x_i) = -\log x_i $ is called the self-information or surprisal/surprisingness associated to each particular outcome $X=i$ of the random variable $X$, and the information entropy $H(X)$ is the expected value of $I(X)$ (with respect to $X$ since the EV is calculated using the $x_i$). Similarly, the observer's surprise upon learning the true location of the ball, relative to their prior expectations $y_i$, would be $I(y_i)=-\log y_i$. But since we don't know with what strategy the $y_i$ have been selected, we don't know how sensible the observer's surprise really is. In particular, if the observer is dead sure that the ball is in the first (or the last) place, i.e. if $y_1=1$ (or $y_N=1$), then they are going to have infinite surprise in case they discover otherwise. In particular, we don't know whether the Kullback-Leibler divergences $ D_{KL}(X||Y)=\sum x_i\log\frac{x_i}{y_i} \quad\ne\quad D_{KL}(Y||X)=\sum y_i\log\frac{y_i}{x_i} $ even exist, because we don't know whether $X$ and $Y$ are "compatible" in the sense that the support $ S(X)=\{i \mid x_i > 0\} ,\quad S(Y)=\{i \mid y_i > 0\} $ of their distributions are equal. In particular, $D_{KL}(X||Y)$ exists iff $S(X) \subset S(Y)$. In words, the observer must at least admit the possibility of $Y=i$ (by assigning it a probability greater than zero) of each possible outcome of $X$. Let us assume $\mathbf{\forall i\,x_i,y_i > 0}$.

Now, we turn our attention to the situation where the contents of the urns (all but one of which are empty) are revealed iteratively, so that at each stage $j$, for $j$ increasing from $1$ to $N$, the observer learns new information. Each time that new information is made available, either one urn is ruled out, or all the others are; once the ball is revealed, there is no more information to learn. These new bits (literally for base $2$ logarithms -- at least for independent observations) of information could be modeled with additional random variables $X_i$ taking on the values $0$ or $1$ and indicating whether the ball was found in urn $i$, so that $P(X_i=1) = x_i$. We could also model the cumulative observation of finding the ball in an urn $i\le j$ by $Y_j=\sum_{i=1}^{j}X_i$, where $P(Y_j=1)=\sum_{i=1}^{j}P(X_i=1)$ since the events $X_i=1$ are mutually exclusive. In A First Course in Bayesian Statistical Methods, Peter Hoff (2009) states that

Mathematical results of Cox (1946, 1961) and Savage (1954, 1972) prove that if $p(\theta)$ and $p(y|\theta)$ represent a rational person’s beliefs, then Bayes’ rule is an optimal method of updating this person’s beliefs about $\theta$ given new information $y$.

In the context of Bayesian inference, we would say that $X$ is a parameter, (Hoff's $\theta$) with prior distribution $P(X=i)=p_X(i)=x_i$, and that $X_i$ & $Y_i$ are samples. These samples are random variables, but have what is called conditional dependence. That is, their densities are not i.i.d. (independent, identically distributed), because they reflect sampling without replacement, and as the experiment progresses, the sampling in fact becomes the full sample space. Within this framework, we could write: $ \eqalign{ & P(X_j=1|X=i) = \delta_{ij} = \left\{ \matrix{ 1 & i=j \cr 0 & i \ne j } \right. \cr & P(Y_j=1|X=i) = \sum_{k \le j} P(X_k=1|X=i) = \sum_{k \le j} \delta_{ik} = \chi(i \le j) = \left\{ \matrix{ 1 & i \le j \cr 0 & i \gt j } \right. \cr & P(Y_j=y|X=i) = y\oplus\chi(i \gt j) = \left\{ \matrix{ 1 &\quad i\le j \wedge y=1 ~\vee~ i\gt j \wedge y=0 \cr 0 &\quad i\gt j \wedge y=1 ~\vee~ i\le j \wedge y=0 } \right. } $ where, for convenience, $\delta_{ij}$ is the Kronecker delta function, $\chi$ represents a generic characteristic function of a set or condition, and $\oplus=\text{XOR}$ is addition modulo $2$. These conditional probabilities of making certain observations, conditioned on the parameter $X=x$, are called sampling models. We turn next to the posterior distribution of $X$ after having observed $Y_j=y$. By Bayes' rule, the posteriors are $ P(X=i|Y_j=y) = \frac{P(Y_j=y|X=i)\cdot P(X=i)}{\sum_{k=1}^{N} P(Y_k=y|X=k)\cdot P(X=k)} = \frac{\left(\chi(i \gt j)\oplus y\right)\,x_i}{\sum_{k=1}^{N} \left(\chi(k \gt j)\oplus y\right)\,x_k} = x_i \psi_{ij} $ for "update factors" $ \psi_{ij}(y) = \frac{\chi(i \gt j)\oplus y}{(1-y)\,P(X\gt j)+y\,P(X\le j)} = \left\{ \matrix{ \frac{\chi(i \le j)}{P(X \le j)} & y=1 \cr \frac{\chi(i \gt j)}{P(X \gt j)} & y=0 } \right. $ which, specifically, represent $ \psi_{ij}(y) = \left\{ \matrix{ 0 & i\gt j \wedge y=1 & \text{tail cancellation (game over);}\cr 0 & i\le j \wedge y=0 & \text{iterative cancellation;} \cr P(X \gt j)^{-1} & i\gt j \wedge y=0 & \text{prospective renormalization;} \cr P(X \le j)^{-1} & i\le j \wedge y=1 & \text{retrospective renormalization.} } \right. $ At each stage $j$, we will iteratively replace the prior probabilities $x_i=P(X=i)$ with the posteriors $x_i\psi_{ij}$ found in the rule above. Note in particular that when we find the ball, the "retrospective renormalization" will set the posterior $P(X=j|Y_j=1)$ to $1$ but have no effect on $P(X=i|Y_j=1)$ for $i < j$, since these will have already been "iteratively cancelled".

Let us denote the posterior probabilities $P(X=i|Y_j)$ at stage $j$ by $x_{ij}$, where $j=0$ represents the priors. Then our rule becomes an algorithm whose iterative step is to calculate new posteriors $x_{ij}=x_{i,\,j-1}\cdot\psi_{ij}$ using the last formula above for $\psi_{ij}$, but with the reciprocal probabilities coming from the previous iterate's $(j-1^\text{th})$ posteriors (or the $j^\text{th}$ priors) rather than the original priors. With this notation, and with $y_j$ indicating whether the ball has been found by stage $j$ (within an urn $i\le j$), we then have $ x_{ij}(y_j) = \left\{ \begin{array}{cl} x_i & j=0 \cr 0 & j>0 \wedge \left(i\le j\wedge y_j=0~\vee~i\gt j\wedge y_j=1 \right)\cr {x_{i,\,j-1}\over\sum_{k\gt j}x_{k,\,j-1}} & j>0 \wedge i\gt j \wedge y_j=0\cr {x_{i,\,j-1}\over\sum_{k\le j}x_{k,\,j-1}} & j>0 \wedge i\le j \wedge y_j=1 \end{array} \right. $ as an explicit stepwise/evolutionary algorithm, or as a product of the prior and cumulative update factors $ x_{ij} = x_i \cdot \prod_{k\le j} \psi_{ik}(y_k). $ Alternatively, conditioning $Y_j$ on its predecessor, in terms of the original priors, we could write: $ \eqalign{ P(Y_j=y_j|Y_{j-1}=1) &= y_j \cr P(Y_j= 1 |Y_{j-1}=0) &= \frac{ x_j}{\sum_{k\ge j}x_k} \cr P(Y_j= 0 |Y_{j-1}=0) &= \frac{1-x_j}{\sum_{k\lt j}x_k} }$

In particular, please note that the middle line above corresponds to your $o_i$, which reflects incorporation of the (cumulative) knowledge that $Y_{i-1}=0$ (i.e. $X_1$ through $X_{i-1}$ were all zero, not just the last!). This is therefore also consistent with Bayesian updating as modeled above by the sequence $\{Y_j\}$ of RVs, and can be naturally extended from the single probability $x_{i,i-1}$ to the full $j^\text{th}$ posterior distribution, $x_{ij}$, for $j=i-1$ and all $i$ by cancelling out for $i and renormalizing for $i\ge j$.

Now, we can turn back to an information-theoretic perspective and evaluate the surprise and information gain.

I should perhaps show that if we start with a uniform prior, it has (the maximum possible) entropy of $N$ bits (with base-two logarithms), relative to which each update is informative, contributing (and decreasing the remaining posterior entropy) either one bit of information in the case of the ball not being revealed, or the remaining bits when revealed. It would be helpful to trace the evolution of the posteriors $x_{ij}$ as a matrix, below, with the first column ($j=0$) being the priors, and the first "identity/unit vector" column $x$ representing the ball being found in urn $x$ at stage $j=x$. Up until the ball is found, we truncate the sample space and renormalize the uniform distribution for the remaining possible locations. The entropy/uncertainty/uniformity of column/posterior $j$ is thus $\log(N-j)$ for $j and $0$ for $j\ge x$, depending on the value of parameter $x$, the ball's "true" location. $ (x_{ij})= \left[ \begin{matrix} \tfrac1N & 0 & 0 & 0 &\cdots\cr \tfrac1N & \tfrac1{N-1} & 0 & 0 &\cdots\cr \tfrac1N & \tfrac1{N-1} & \tfrac1{N-2} & 0 &\cdots\cr \tfrac1N & \tfrac1{N-1} & \tfrac1{N-2} & 1 &\cdots\cr \vdots & \vdots & \vdots & \vdots &\cdots\cr \tfrac1N & \tfrac1{N-1} & \tfrac1{N-2} & 0 &\cdots\cr \end{matrix} \right] $

I see now what you are doing, I think. Your surprise, $I(o_i)=-\log o_i$, is the self-information of the single probability $o_i=P(X_i=1|Y_{i-1}=0)$ from the $j=(i-1)^\text{th}$ posterior. Your $A$ is the information content (negative entropy) of a "collapsed" probability distribution taken from "projecting" the (arbitrary) $(j+1)^\text{th}$ posterior $x_{i,j+1}$ onto a Bernoulli distribution with two possible outcomes $X=i+1$ and $X\ne i+1$. Define a Bernoulli RV $Z_i=X_i|Y_{i-1}=0$ conditioned on not yet having discovered the ball after stage $i-1$, set $o_i=P(Z_i)$ and calculate this RV's information at the next(!) stage: $A=-H[Z_{i+1}]$. Finally, define another "autocorrelated" RV $W_i=Z_i|Z_{i-1}$ from $Z_i$ conditioned on its predecessor, and take its entropy relative to the corresponding time evolution of $Z$: $B=-H_{Z_{i+1}}[W_{i+1}]$. I'm not sure how useful this would be, or whether it is really what you want. I'll try to check that and elaborate next time. My gut feeling is that it is not "informative" enough to be of much use.

Until then, please read sections 1.1 & 1.2 of Hoff from the link above, and pages 2-4 of Ebrahimi, Maasoumi & Soofi (1999), "Measuring Informativeness by Entropy and Variance", if in need of a bit more intuition.

  • 0
    I'll sure do... I will take some time to prepare a nice pdf...2012-03-02