I need to calcuate the cluster entropy to represent the distinctness of a phrase in a set of documents. I understand the meaning behind it but I don't understand the notation equation itself, as in the order it should be calculated in. Could someone please explain what it means? Here is the definition:
For given phrase w, the corresponding document set D(w) might
overlaps with other D(wi) where wi≠w. At one extreme, if D(w) is
evenly distributed in D(wi), w might be a too general phrase to be
a good salient phrase. At the other extreme, if D(w) seldom
overlaps with D(wi), w may have some distinct meaning. Take
query "jaguar" as an example, "big cats" seldom co-occur with
other salient keywords such as "car", "mac os", etc. Therefore the
corresponding documents may constitute a distinct topic.
However, "clubs" is a more general keyword which may co-occur
with both "car" and "mac os", thus it should have less salience
score.
We use Cluster Entropy (CE) to represent the distinctness of a
phrase
$$CE = - \sum_t\frac{|D(w)\cap D(t)|}{|D(w)|} log \frac{|D(w)\cap D(t)|}{|D(w)|} $$