1
$\begingroup$

I was playing around with using pagerank on graphs to make inferences and the results seemed pretty good but I thought the results would improve if I deleted the lowest scoring nodes(which tend to be not good results generally).

I thought this would improve the results but it seems like the opposite is happening, my inferences are very inaccurate now.

So my question, generally is it better to keep the universe as large as possible when making inferences or I am most likely doing something wrong here?

Thanks,

Note: I'm a bit new to math so I'm asking this not only specific to pagerank but to other inference/ranking type algo's. I always thought cleaning up the data is good but not sure now.

  • 0
    okay, I'm a bit slow, are you implying as I remove the pennies the relatitive spread between the high and the low gets less and less and thus messes my inferences?2012-02-01

1 Answers 1

2

Removing nodes may or may not improve the quality of the ranking.

  1. Consider a graph with vertices a, a', b_1\cdots b_{1000}, b'_1\cdots b'_{50}, c_1\cdots c_{1000}. With edges a' \to a, $b_i\to a$. b'_i \to a' and $c_i \to b_i$. If you prune the graph to remove all vertices with no incoming links (that would be $c_i$ and $b'_i$), then the weight on a' would drop compare to $a$. And you may say that this gives "better results".
  2. On the flip side, consider your initial graph as the a, a', b_1\cdots b_{1000}, b'_1\cdots b'_{50}, c', d' with a' \to a, $b_i \to a$, b'_i \to a', c' \to a' and d' \to c'. PageRank should clearly show that $a$ is the best. But if you remove all vertices with no incoming links, you are left with a,a',c',d' and the ranking of $a$ and a' would get closer.

Basically, the problem is that PageRank is computed recursively. So depending on the exact nature of your graph, how the effect of removing some low scoring nodes propagate can differ. To give another contrived example:

Let your graph be given by $a_1,\ldots, a_{1000}, b_1, \ldots b_{1000}, c_1, \ldots, c_5, d_1, \ldots d_{100}$. Let the arrows go $a_{i+1} \to a_i$ and $b_{i+1}\to b_i$. And let $c_k \to a_{1000}$ and $d_k\to b_{1000}$. Then in this scheme $b_1$ is a better ranked object than $a_1$. But if you remove just 5% of the vertices by removing all of $c_i$ and $d_i$, then PageRank can't tell $a_1$ and $b_1$ apart.

  • 0
    Thanks Willie. I get the factors I need to consider now. Thanks so much.2012-02-02