3
$\begingroup$

I have trouble in understanding a simple example following No Free Lunch theorem in James Spall's Introduction to stochastic search and optimization:

My understanding is that a cost function is a mapping from a finite set $\{\theta_i\}$ to a finite set $\{L_j\}$. A search algorithm, when applied to a cost function, is to search for a $\theta_i$ such that the value of the cost function at it is the least.

If there are $N_\theta$ possible values in $\{\theta_i\}$ and $N_L$ possible values in $\{L_j\}$, then by direct enumeration there are $N_L^{N_\theta}$ possible mappings from $\{\theta_i\}$ to $\{L_j\}$.

The NFL theorems indicate that: When averaging over all $N_L^{N_\theta}$ possible mappings from from $\{\theta_i\}$ to $\{L_j\}$, all algorithms work the same (i.e., none can work better than a blind random search).

Example 1.7—NFL implications in a small problem. Suppose that $\Theta = \{\theta_i, \theta_2, \theta_3\}$ and that there are two possible outcomes for the noise-free loss measurements, $\{L_1, L_2\}$. Hence, $N_L^{N_\theta} =2^3 =8$. Table 1.1 summarizes the eight possible mappings.

Note that all rows in Table 1.1 have the same number of $L_1$ and $L_2$ values. Hence, the mean loss value across all eight possible problems is the same regardless of which $\theta_i$ is chosen. Because the algorithm chooses the $\theta_i$, all algorithms produce the same mean loss value when averaging across all possible problems. As a specific instance of the implication of the NFL theorems, suppose that $L_1 < L_2$. Then, an algorithm that puts priority on picking $\theta$ will work well on problems 1, 2, 3, and 7, but poorly on problems 4, 5, 6, and 8. The average performance is the same as an algorithm that puts priority on picking $\theta_2$ or $\theta_3$.

enter image description here

My questions are:

  1. Does the example assume that an algorithm will always output the same element in $\Theta$, regardless of which of the eight loss function it is applied to? (This is the only way that I can make sense out of the example. Otherwise, how shall I understand it?)

    If yes, isn't it contrary to what an algorithm is in reality? For example, for minimizing different cost functions, the gradient descent algorithm outputs different solutions instead of a fixed solution.

    If no, I can construct an algorithm that outputs $\theta_1$ for cost function 1, 2, 3, 7 and 8, $\theta_2$ for cost function 4 and 5, and $\theta_3$ for cost function 6. This algorithm outputs the best solution for each cost function, so when considering its average performance over all cost functions, it is obviously better than many other algorithms. This violates NFL theorem. What is going wrong?

  2. What exactly is a search algorithm? Is it a mapping from some set to another set? .

Thanks and regards!

2 Answers 2

3

I think the answer to 2 will guide you a little better on 1. First off, you should probably read the Wolpert and McReady 1997 paper for a more precise definition than what I'm stating here.

[1] There are only 8 possible mappings between the inputs and the outcomes, each of which is enumerated as a column of the table. Your constructed example assumes that you have more information about the problem (in this case, that you know the cost functions). NFL is restricted to cases where you have no a priori knowledge of the domain, which is not strictly true in many practical applications.

[2] A search algorithm (using the terms defined in your question), is an ordered sequence of the inputs (a tuple in computer science terms).

  • 0
    +1. Thanks! (1) But I don't quite understand [1]. The NFL theorem states that **all algorithms** work the same when averaging over all cost functions, so I think the algorithm constructed by me should work the same as other algorithms, but obviously it is not the case because the constructed algorithm is the best actually. This has nothing to do with knowing the domain with prior knowledge or not. (2) Do you mean the output of a search algorithm is a sequence of elements in $\Theta$? What is its domain? Is it the set of all possible cost functions?2012-02-11
  • 0
    Your algorithm must emit the same search sequence over all possible "cost functions", or it does not qualify under NFL.2012-02-11
  • 0
    Thanks! (1) If that is the case, isn't it contrary to what an algorithm is in reality? For example, the gradient descent algorithm outputs different solutions instead of a fixed one for minimizing different cost functions. (2) If NFL is really that way, how meaningful can it be, since it rules out so many algorithms?2012-02-11
  • 0
    (3) If an algorithm outputs the same regardless of what the cost function might be, I wonder when viewing an algorithm as a mapping, what its domain is? If its domain is exactly the set of all cost functions, it is a constant mapping, isn't it?2012-02-12
1

Question 2, first. In NFL analysis, optimization (search) is decomposed into sampling of the objective function (mapping), and use of the sample to generate an output. It is tacitly assumed that all optimizers under comparison process samples identically. The NFL literature uses the terms search algorithm and optimization algorithm to refer to the sampler alone. When the codomain of the mapping (set of loss values) is finite, the most intuitive representation of a deterministic (non-random) sampler is as a decision tree. I provide one that is a fairly good match for Spall's example in Figure 1 of my 2004 paper on NFL.

Back to Question 1. You can know quite a bit about the mapping, and still not know how to go about optimizing it. For instance, I can tell you that two elements of the domain map to $L_1,$ and that the other maps to $L_2.$ This how many information takes you from eight possible mappings to just three: those in columns 2, 4, and 7. But you are still totally lacking in information as to which inputs are associated with which outputs. That is because the set of remaining functions $F^\prime = \{f_2, f_4, f_7\}$ is closed under permutation. For each $f$ in $F^\prime$,

$$F^\prime = \{f \circ \sigma \mid \sigma \mathrm{~is~a~permutation~of~}\Theta\}.$$

(A permutation is a 1-to-1 correspondence on a set.) The condition under which you cannot justify a choice of sampler is that the set of mappings is, as far as you know, closed under permutation. A good way to understand why things work this way is to scrutinize the decision tree.