I am working on my bachelor thesis where I attemp to create a focused web crawler. My program is finished and now I am a bit stuck with computing ranking (or rating) of single pages. I am not trying to build an index of pages or anything. All I want is to rate all pages independently (rating of one page does not depend on the others) according to term frequencies of searched words.
At the moment I am using very simple formula:
$rating = tfidf + a_1 tf_1 + ... + a_n tf_n$
tfidf... tf-idf cosine distance of this page from other pages (this compoment actually depends on other pages but it is a very small number and you can ignore it in this question), a_i... coefficients (some big number, like 100-1000), tf_i... term frequency of word_i I am searching on current page
Lets assume that n is usually 1-4 (I rarely search more than 4 words at once). Right now, all coefficients a_i are the same, but that does not satisfy my needs.
I am looking for a math formula, that would meet my requirements
1) If page A contains all words I am searching it should have better rating than page B which contains only one searched word (which happens to have very high tf). 2) I dont want the coefficients to reach 0 even if the page contains only one of all the words I am searching - page A with one word_i should have better rating than page B with no searched words at all.
In other words- I would like for a coefficient $a_i$ to depend on $tf_j$, where $i \ne j$.
EXAMPLE
searched words: orange, banana, apple (Its a dumb example, I know - sorry)
page A: 5 x orange, 1 x banana, 0 x apple (lets assume that all pages have the same amount of words in them)
page B: 1x orange, 1 x banana, 1 x apple
page C: 100 x orange, 0 x banana, 0 x apple
page D: none of those
I want to use my formula rating = tfidf + a_1*tf_1 + ... + a_n*tf_n so that B>A>C>D
I would be glad for any ideas. I hope that this whole question is not too confusing (if so, I will try to re-edit). Thanks
