I have a collections of "systems" that I want to test their performance, $1,\ldots,k$.
I have a small set of samples, for which I can test whether each condition does better or not than the other, so I can get a binary result $I(i,j)$ for each $1 \le i,j \le k$ that tests whether $i$ or $j$ do better on this small sample.
This $I(i,j)$ is calculated by the following way: each $i$ is associated with a "system", like I said. We have a collection of samples $X_1,...,X_n$, and we measure $C(i) = 1/n \sum_{l=1}^n l(X_l,i)$ where $l(X_l,i)$ is a measure of how well system $i$ does on sample $l$. Then we just measure whether $C(i) > C(j)$ or vice versa (to get $I(i,j)$).
Can I pick from these collections of binary indicators the system which is mostly likely to do best on the expected value of the whole distribution? Are there any ways to have theoretical guarantees for that, and do they depend on $k$?
By "do best on the expected value" I am referring to $D(i) = \mathbb{E}[l(X),i]$ - this should be highest for the system I pick.