1
$\begingroup$

Given five sets:

[A, B, C, D] [M, N, O, P, Q, R] [W, X, Y, Z] [1, 2, 3, 4, 5] [7, 8, 9]

Five combinations are determined at random, e.g. A N Y 1 8 and C Q W 4 9. Each answer is distinct (A is only in one answer, X is only in one answer, etc).

Given only yes or no questions, what is the most efficient way to find all answers?

A simple way of solving this is to start asking questions to halve the possible answers, e.g. Does the first answer contain A or B? If yes, you know A or B from the first set, if no, you know C or D from the first set. It seems like there should be a way to use OR and AND to combine questions (does the first answer contain A or B and does it contain M, N, or O) to resolve a total solution faster, but my brain can't figure it out.

In my actual situation, the number of sets can change and the number of elements in each set is variable.

PS: I didn't know what tag to put on this question, so please feel free to change it.

  • 0
    Correct. Once an element from a set has been used in an answer, it cannot be used in another answer. By "combination," I meant five elements, one from each set.2011-09-26

3 Answers 3

2

You can view this problem information theoretically. In essence, you have a message set comprised of a sequence of alphabets/numbers chosen from the five sets. Each letter within a set is assumed to be equiprobable. What you are looking for is the ideal "source compression algorithm" which will essentially take a message (or a large sequence of messages) and give a string of 0s and 1s as output (the 0s and 1s represent the answers to your yes/no questions and therefore uniquely identify the message sequence).

Therefore, you can calculate the entropy of the message and that will give you a lower bound on any possible strategy you can come up with (you should check out the information theory literature for a proof - it is basically Shannon's noiseless channel coding theorem). If you are given only a single message and should find out what the message is based on yes/no questions, your best bet is to simply halve the set of messages at each stage, so you would ask "does the set contain (A OR B)" and so on.

The entropy of the messages is $\log_2(4) + \log_2(6) + \log_2(4) + \log_2(5) + \log_2(3)$

0

The fastest strategy is to ask questions which halve the number of possible combinations.

  • 0
    @Th$i$js: Indeed, but I did want to point out how to reduce the number of options which had to be listed.2011-09-27
0

Using only yes or no questions, if there are $n$ distinct answers, then you will need at least $k$ questions, where k is the smallest integer such as $2^k\geq n$.

One way to view this is to see your questions as a boolean function with $k$ arguments. Since this function needs to take at least $n$ values (if you want to be able to guess all the answers), then you need $k$ at least as big as $\log_2(n)$. Of course, this lower bound is exactly the entropy mentioned by svenkatr.

In your example: \begin{align*} k&=\lceil \log_2(4) + \log_2(6) + \log_2(4) + \log_2(5) + \log_2(3)\rceil\\ &=\lceil \log_2(4\cdot6\cdot4\cdot5\cdot3)\rceil\\ &= 11. \end{align*}

So you will need at least 11 questions, no matter how cleverly designed your questions are.

One way to achieve this limit is to use a dichotomy. This is not very practical since you would need to need to devise 11 questions that would independently separate all your answers in two groups. As mentioned by svenkatr, you could always enumerate all the possible answers and use their position written in binary: then your $i$-th question would be something like "is the $i$-th digit of the position of the answer written in binary a zero?".

That would provide an optimal "worst case" algorithm. But surely you could better by keeping tracks of the remaining possible answers at each step and adapting your questions to them.