1
$\begingroup$

I need to calcuate the cluster entropy to represent the distinctness of a phrase in a set of documents. I understand the meaning behind it but I don't understand the notation equation itself, as in the order it should be calculated in. Could someone please explain what it means? Here is the definition:

For given phrase w, the corresponding document set D(w) might
overlaps with other D(wi) where wi≠w. At one extreme, if D(w) is
evenly distributed in D(wi), w might be a too general phrase to be
a good salient phrase. At the other extreme, if D(w) seldom
overlaps with D(wi), w may have some distinct meaning. Take
query "jaguar" as an example, "big cats" seldom co-occur with
other salient keywords such as "car", "mac os", etc. Therefore the
corresponding documents may constitute a distinct topic.
However, "clubs" is a more general keyword which may co-occur
with both "car" and "mac os", thus it should have less salience
score.
We use Cluster Entropy (CE) to represent the distinctness of a
phrase

$$CE = - \sum_t\frac{|D(w)\cap D(t)|}{|D(w)|} log \frac{|D(w)\cap D(t)|}{|D(w)|} $$

1 Answers 1

1

It is very simple. Here

$$CE = - \sum_t\frac{|D(w)\cap D(t)|}{|D(w)|} log \frac{|D(w)\cap D(t)|}{|D(w)|} $$

$|D(w)\cap D(t)|$ gives you the intersection of one word with all others in the text. If you divide it by the total number of the occurances of that word $D(w)$, you will get an estimated probability of intersection $p_i$.

From the definition of entropy, then you will get $$ H=\sum_i p_i log p_i $$

  • 0
    I understand what the equation does but what I don't understand is the actual notation. Say $\frac{|D(w)\cap D(t)|}{|D(w)|}$ = 0.12; is the equation then: CE = -(0.12*(log(0.12))), thereby giving an answer of +0.9208....2012-07-27
  • 0
    No. You will have a set of probabilities parmetrized by $t$. Per $t$ you will have $p_i$. The entropy which will finally find will be between $0$ and $log N$ where N is the cardinality of your set of probabilities.2012-07-27