2
$\begingroup$

I am a computer programmer and I am building a search engine for a client. Right now I am puzzling myself about the order in which I should return search results. There are two obvious orderings:

  • Relevancy - Documents that contain most overlap with the query text first.
  • Popularity - Documents that are more popular first.

Both of these orderings are too extreme, I would rather have something that is "right in the middle of both of them". Is there any mathematically significant middle ground between these orderings?

Let's take a slightly more mathematically rigorous stab at it.

  • Let $D$ be the set of all documents returned in a given search
  • Define a relevancy function: $rel:D\rightarrow\mathbb{R}$ that imposes an ordering upon $D$
  • Define a popularity function: $pop:D\rightarrow\mathbb{R}$ that imposes an ordering upon $D$
  • Define a mixture function: $mix(d;\alpha)=\alpha\cdot rel(d)+(1-\alpha)\cdot pop(d)$

The goal, then, is to determine $\alpha$ such that the ordering imposed upon $D$ by $mix$ is at the "special, magically, just right point" between the ordering imposed by $rel$ and the ordering imposed by $pop$. (Admittedly, my mathematical rigor has broken down here just a bit.)

Thinking naïvely about this, $\alpha=\frac{1}{2}$ seemed to be a good choice, but then I realized that this doesn't take scaling into account at all. For instance, if $pop$ maps $D$ to a much broader range of values than $rel$, then it's possible that the $mix$ ordering will be identical to the $pop$ ordering!

So is there any "sweet-spot" for $\alpha$? Is there a way to determine it? The one I'm visualizing right now has to do with the permutation matrices that map the $rel$ ordering to the $mix$ ordering, (denoted $P_r$), and the permutation matrices that map the $pop$ ordering to the $mix$ ordering, (denoted $P_p$). If $\alpha=1$ then $P_r$ is the identity matrix and $P_p$ is "as scattered as possible". If $\alpha=0$ then $P_r$ is "as scattered as possible" and $P_p$ is the identity matrix. Is there some value for $\alpha$ that balances these two extremes, so that $P_r$ and $P_p$ reside as close as possible to the identity matrix without hurting the other? Is there any other notable points between the $rel$ and $pop$ ordering that I haven't considered?

  • 0
    "For instance, if pop maps D to a much broader range of values than rel, then it's possible that the mix ordering will be identical to the pop ordering!" This is perhaps desirable. If all your pages have basically the same relevance, then you ought to be basing your decision primarily on popularity. If, in general, you value popularity and relevance equally in deciding whether a page is "good", then $\alpha = \frac{1}{2}$ seems to be the only choice.2011-08-13
  • 0
    -@Austin Mohr - The problem, though, is that the scaling of the $pop$ function is arbitrary. You might be right, but I think that in general my "sweet-spot" described in the last paragraph is the best-bet. Of course, all rigor has gone out the window at this point, so perhaps what I want is an enumeration of all the "interesting" points between $\alpha=0$ and $\alpha=1$ - all the interesting points between these two orderings.2011-08-14
  • 0
    Perhaps you can first normalize your data so that all values from $pop$ and $rel$ are between 0 and 1, then average these two normalized scores.2011-08-14
  • 0
    This comment has very little to do with the math. Presumably your client is interested in certain kinds of searches. Pick a few of them. Try different ordering functions. Ask the client which ordering seems to be close to what the client wants. Probably the result will need fine tuning.2011-08-14
  • 1
    You may need to provide an override so that the "top $6$" (or whatever) in either category appear quite early on.2011-08-14
  • 0
    How is _pop_ calculated? If it's based on the clickthroughs of search results, then you should add some randomness to this function so that you're not having a runaway function. Are you using solr for the search, btw, and have you checked out http://wiki.apache.org/solr/FunctionQuery for customizing the relevancy values based on _pop_?2011-08-14
  • 0
    @Ben - the way that $pop$ is calculated is unimportant, because on the next job I'll calculate $pop$ a different way, but be stuck the same issue as we see here: how to best mix the two functions together. I've got an image in my mind of what I think an ideal mixture would be - it has to do with the closeness of $P_p$ and $P_r$ to the identity matrix - but I don't believe that I can explain the concept well without taking time to create images.2011-08-14
  • 0
    The reason I ask about how _pop_ is calculated is that I'm wondering if it's a global popularity (like simple ratings), or if it's more localized (like based on how people click on search results). I believe that the ideal mixture would be different in each case. I have a hard time seeing why a linear mixture is even correct with a global popularity. For example, a perfect match that isn't popular should probably be higher in the results than a weak match that's extremely popular. You need a function that when plotted out gives the desired results.2011-08-15

1 Answers 1