I am a computer programmer and I am building a search engine for a client. Right now I am puzzling myself about the order in which I should return search results. There are two obvious orderings:
- Relevancy - Documents that contain most overlap with the query text first.
- Popularity - Documents that are more popular first.
Both of these orderings are too extreme, I would rather have something that is "right in the middle of both of them". Is there any mathematically significant middle ground between these orderings?
Let's take a slightly more mathematically rigorous stab at it.
- Let $D$ be the set of all documents returned in a given search
- Define a relevancy function: $rel:D\rightarrow\mathbb{R}$ that imposes an ordering upon $D$
- Define a popularity function: $pop:D\rightarrow\mathbb{R}$ that imposes an ordering upon $D$
- Define a mixture function: $mix(d;\alpha)=\alpha\cdot rel(d)+(1-\alpha)\cdot pop(d)$
The goal, then, is to determine $\alpha$ such that the ordering imposed upon $D$ by $mix$ is at the "special, magically, just right point" between the ordering imposed by $rel$ and the ordering imposed by $pop$. (Admittedly, my mathematical rigor has broken down here just a bit.)
Thinking naïvely about this, $\alpha=\frac{1}{2}$ seemed to be a good choice, but then I realized that this doesn't take scaling into account at all. For instance, if $pop$ maps $D$ to a much broader range of values than $rel$, then it's possible that the $mix$ ordering will be identical to the $pop$ ordering!
So is there any "sweet-spot" for $\alpha$? Is there a way to determine it? The one I'm visualizing right now has to do with the permutation matrices that map the $rel$ ordering to the $mix$ ordering, (denoted $P_r$), and the permutation matrices that map the $pop$ ordering to the $mix$ ordering, (denoted $P_p$). If $\alpha=1$ then $P_r$ is the identity matrix and $P_p$ is "as scattered as possible". If $\alpha=0$ then $P_r$ is "as scattered as possible" and $P_p$ is the identity matrix. Is there some value for $\alpha$ that balances these two extremes, so that $P_r$ and $P_p$ reside as close as possible to the identity matrix without hurting the other? Is there any other notable points between the $rel$ and $pop$ ordering that I haven't considered?