2
$\begingroup$

I am a computer programmer and I am building a search engine for a client. Right now I am puzzling myself about the order in which I should return search results. There are two obvious orderings:

  • Relevancy - Documents that contain most overlap with the query text first.
  • Popularity - Documents that are more popular first.

Both of these orderings are too extreme, I would rather have something that is "right in the middle of both of them". Is there any mathematically significant middle ground between these orderings?

Let's take a slightly more mathematically rigorous stab at it.

  • Let $D$ be the set of all documents returned in a given search
  • Define a relevancy function: $rel:D\rightarrow\mathbb{R}$ that imposes an ordering upon $D$
  • Define a popularity function: $pop:D\rightarrow\mathbb{R}$ that imposes an ordering upon $D$
  • Define a mixture function: $mix(d;\alpha)=\alpha\cdot rel(d)+(1-\alpha)\cdot pop(d)$

The goal, then, is to determine $\alpha$ such that the ordering imposed upon $D$ by $mix$ is at the "special, magically, just right point" between the ordering imposed by $rel$ and the ordering imposed by $pop$. (Admittedly, my mathematical rigor has broken down here just a bit.)

Thinking naïvely about this, $\alpha=\frac{1}{2}$ seemed to be a good choice, but then I realized that this doesn't take scaling into account at all. For instance, if $pop$ maps $D$ to a much broader range of values than $rel$, then it's possible that the $mix$ ordering will be identical to the $pop$ ordering!

So is there any "sweet-spot" for $\alpha$? Is there a way to determine it? The one I'm visualizing right now has to do with the permutation matrices that map the $rel$ ordering to the $mix$ ordering, (denoted $P_r$), and the permutation matrices that map the $pop$ ordering to the $mix$ ordering, (denoted $P_p$). If $\alpha=1$ then $P_r$ is the identity matrix and $P_p$ is "as scattered as possible". If $\alpha=0$ then $P_r$ is "as scattered as possible" and $P_p$ is the identity matrix. Is there some value for $\alpha$ that balances these two extremes, so that $P_r$ and $P_p$ reside as close as possible to the identity matrix without hurting the other? Is there any other notable points between the $rel$ and $pop$ ordering that I haven't considered?

  • 0
    The reason I ask about how _pop_ is calculated is that I'm wondering if it's a global popularity (like simple ratings), or if it's more localized (like based on how people click on search results). I believe that the ideal mixture would be different in each case. I have a hard time seeing why a linear mixture is even correct with a global popularity. For example, a perfect match that isn't popular should probably be higher in the results than a weak match that's extremely popular. You need a function that when plotted out gives the desired results.2011-08-15

1 Answers 1

3

One crude but possibly effective solution might be to rescale $rel$ and $pop$ so that, say, the most relevant/popular document gets the value $1$, the next on each scale gets the value $2$, and so on. Then do your averaging thing.

A slightly less brutal adjustment would be to linearly scale $rel$ and $pop$ so that they each have the same mean and variance. (Basically, you subtract the mean and divide by the standard deviation.) This can still fail if the higher moments of the distributions are very different, but it might work well if both $rel$ and $pop$ can be expected to be more or less similarly distributed after scaling.

  • 0
    @John Berryman: Even if you can't control the original relevancy/popularity functions, surely you can define your own functions based on them?2011-08-14