1
$\begingroup$

I'm computing cosine similarities between 2 vectors.

These vectors are information retrieval query and document representations respectively.

They have been computed using tf/idf weights.

Since my documents have different length, tf/idf weights are theoretically unbounded.

The question is: is cosine similarity still a valid measure ? Can I compare several cosine similarities for each doc ?

thanks

  • 0
    Try askin$g$ at http://metaoptimize.com/qa. It's the q&a forum for machine learning related topics, including information retrieval. And it's just a hunch but if your vectors are defined over the entire vocabulary, and elements corresponding to words that don't appear in the document are given a value of zero, then I don't see why you'd have trouble doing cosine.2011-05-28

1 Answers 1

1

If I read Wikipedia right, tf/idf is not unbounded. tf $\le 1$ (would be 1 only if the document had all words the same) and idf $\le \log N, N$ the number of documents, with equality if only one document has the term. Despite the slash in tf/idf, these are multiplied so the limit is $\log N$.