1
$\begingroup$

I need to calcuate the cluster entropy to represent the distinctness of a phrase in a set of documents. I understand the meaning behind it but I don't understand the notation equation itself, as in the order it should be calculated in. Could someone please explain what it means? Here is the definition:

For given phrase w, the corresponding document set D(w) might overlaps with other D(wi) where wi≠w. At one extreme, if D(w) is evenly distributed in D(wi), w might be a too general phrase to be a good salient phrase. At the other extreme, if D(w) seldom overlaps with D(wi), w may have some distinct meaning. Take query "jaguar" as an example, "big cats" seldom co-occur with other salient keywords such as "car", "mac os", etc. Therefore the corresponding documents may constitute a distinct topic. However, "clubs" is a more general keyword which may co-occur with both "car" and "mac os", thus it should have less salience score. We use Cluster Entropy (CE) to represent the distinctness of a phrase 

$CE = - \sum_t\frac{|D(w)\cap D(t)|}{|D(w)|} log \frac{|D(w)\cap D(t)|}{|D(w)|} $

1 Answers 1

1

It is very simple. Here

$CE = - \sum_t\frac{|D(w)\cap D(t)|}{|D(w)|} log \frac{|D(w)\cap D(t)|}{|D(w)|} $

$|D(w)\cap D(t)|$ gives you the intersection of one word with all others in the text. If you divide it by the total number of the occurances of that word $D(w)$, you will get an estimated probability of intersection $p_i$.

From the definition of entropy, then you will get $ H=\sum_i p_i log p_i $

  • 0
    No. You will have a set of probabilities parmetrized by $t$. Per $t$ you will have $p_i$. The entropy which will finally find will be between $0$ and $log N$ where N is the cardinality of your set of probabilities.2012-07-27