0
$\begingroup$

I understand that pointwise Mutual Information (pMI) is the probability of a sequence of dependent events conditioned on the probability of its elements treated as independent events.

The definition of mutual information given in Church & Hanks (1990) and based on Fano (1961) is the following:

"[...] if two points (words), x and y, have probabilities P(x) and P(y), then their mutual information, $I(x,y)$, is defined to be:

$$I(x,y) = log_2\frac{P(x,y)}{P(x)P(y)}$$

Informally, mutual information compares the probability of observing $x$ and $y$ together (the joint probability) with the probabilities of observing $x$ and $y$ independently (chance). If there is a genuine association between $x$ and $y$, then the joint probability $P(x,y)$ will be much larger than chance $P(x)P(y)$, and consequently $I(x,y) \gg 0$. If there is no interesting relationship between $x$ and $y$, then $P(x,y) \approx P(x)P(y)$, and thus, $I(x,y) \approx 0$. If $x$ and $y$ are in complementary distribution, then $P(x,y)$ will be much less than $P(x)P(y)$, forcing $I(x,y) \ll 0$.''

Pointwise Mutual Information typically applies to sequences of two events, but it has sometimes been extended to longer sequences. For example, the pointwise mutual information of a sequence ABCD would be:

$$I(ABCD) = log_2\frac{P(ABCD)}{P(A)P(B)P(C)P(D)}$$

I would like to condition of the probability of a sequence not only on its atomic elements but also on all its possible subsequences. In the context of the above example, this would be:

$$I(ABCD) = log_2\frac{P(ABCD)}{P(A)P(B)P(C)P(D)P(AB)P(BC)P(CD)P(ABC)P(BCD)}$$

Does this also fall under mutual information? If not, does this way of conditioning the probability of a set of dependent events have ab explicit mathematical treatment in probability theory and/or information theory? Is there a way to express this way of conditioning events formally in a general fashion applicable to sets of events of arbitrary length? I tried formalizing it this way (see image below) but it seems very clunky and, more importantly, I don't know where to encode the information that there will be N-1 rows in the denominator:

mutual-info-extended

References:

Church, K. W. and P. Hanks (1990). Word association norms, mutual information and lexicography. Computational Linguistics 16(1), 22-29.

Fano, R. (1961). Transmission of Information: A Statistical Theory of Communications. Cambridge, MA: MIT Press.

  • 0
    I don't understand your first definitin of "pointwise mutual information" (could you give a reference or link)? For one thing, it seems to give 1 if the events are independent. Your "extended" definition makes even less sense to me - consider what happens when the events are independent...2017-01-15
  • 0
    Thank you for pointing this out. I included an intuitive description of what mutual information is supposed to do and added two important references below.2017-01-15

0 Answers 0