I need to calcuate the cluster entropy to represent the distinctness of a phrase in a set of documents. I understand the meaning behind it but I don't understand the notation equation itself, as in the order it should be calculated in. Could someone please explain what it means? Here is the definition:
For given phrase w, the corresponding document set D(w) might overlaps with other D(wi) where wi≠w. At one extreme, if D(w) is evenly distributed in D(wi), w might be a too general phrase to be a good salient phrase. At the other extreme, if D(w) seldom overlaps with D(wi), w may have some distinct meaning. Take query "jaguar" as an example, "big cats" seldom co-occur with other salient keywords such as "car", "mac os", etc. Therefore the corresponding documents may constitute a distinct topic. However, "clubs" is a more general keyword which may co-occur with both "car" and "mac os", thus it should have less salience score. We use Cluster Entropy (CE) to represent the distinctness of a phrase
$CE = - \sum_t\frac{|D(w)\cap D(t)|}{|D(w)|} log \frac{|D(w)\cap D(t)|}{|D(w)|} $