2
$\begingroup$

I am working on my bachelor thesis where I attemp to create a focused web crawler. My program is finished and now I am a bit stuck with computing ranking (or rating) of single pages. I am not trying to build an index of pages or anything. All I want is to rate all pages independently (rating of one page does not depend on the others) according to term frequencies of searched words.

At the moment I am using very simple formula:

$rating = tfidf + a_1 tf_1 + ... + a_n tf_n$

tfidf... tf-idf cosine distance of this page from other pages (this compoment actually depends on other pages but it is a very small number and you can ignore it in this question), a_i... coefficients (some big number, like 100-1000), tf_i... term frequency of word_i I am searching on current page

Lets assume that n is usually 1-4 (I rarely search more than 4 words at once). Right now, all coefficients a_i are the same, but that does not satisfy my needs.

I am looking for a math formula, that would meet my requirements

1) If page A contains all words I am searching it should have better rating than page B which contains only one searched word (which happens to have very high tf). 2) I dont want the coefficients to reach 0 even if the page contains only one of all the words I am searching - page A with one word_i should have better rating than page B with no searched words at all.

In other words- I would like for a coefficient $a_i$ to depend on $tf_j$, where $i \ne j$.

EXAMPLE

searched words: orange, banana, apple (Its a dumb example, I know - sorry)

page A: 5 x orange, 1 x banana, 0 x apple (lets assume that all pages have the same amount of words in them)

page B: 1x orange, 1 x banana, 1 x apple

page C: 100 x orange, 0 x banana, 0 x apple

page D: none of those

I want to use my formula rating = tfidf + a_1*tf_1 + ... + a_n*tf_n so that B>A>C>D

I would be glad for any ideas. I hope that this whole question is not too confusing (if so, I will try to re-edit). Thanks

1 Answers 1

1

So having all words once should be more important than having all words but one even though the other words occur much more frequent?

What about introducing coefficients $b_i$ and add $b_i\cdot c_i$ where $c_i$ is a boolean that checks whether a page contains $i$ words. Then $b_n \cdot c_n$ checks if a page contains all words and you assign a large value to $b_n$. The value $c_n$ would simply be 1 or 0. So your rating becomes something like \begin{align} \text{rating}= \text{tfidf} + \sum_{i=1}^n a_i \cdot \text{tf}_i+ \sum_{j=1}^n b_j \cdot \text{c}_j \end{align} Applying this to your example. Lets take $a_i\approx 100$ for all $i$ and let $\text{tfidf}$ be small. Then C only contains 1 matched word, but B has 1 matched, 2 matched and even 3 matched words. $\text{rating(C)} \approx \text{tfidf}+ 100\cdot 100 +b_1\approx 10000+b_1$ $\text{rating(B)} \approx \text{tfidf}+ 3\cdot 100 + b_1+b_2+b_3\approx 300+b_1+b_2+b_3$ Now chose $b_n>>b_{n-1}$ and you get $\text{rating(B)} >\text{rating(C)}$.