4
$\begingroup$

An Internet search engine looks for a keyword in 9 databases, searching them in random order. Only 5 of these databases contain the given keyword. What is the probability that it will be found in at least 2 of the first 4 searched databases?

What I tried is: Let X: the event that the keyword is to be found in at least 2 databases, and Y: the event that we search the first 4 databases, then we need to find a conditional probability $P(X\geq 2 | Y)=P(X= 2,3,\text{ or }4 | Y)=\frac{P(\{X\geq 2\} \cap Y)}{P(Y)}$ but then I don't know how to continue?

Any help...

3 Answers 3

3

Added: Henry's and my answers, while the approaches are different, actually give the same result! To see why, let's do the general case with $N$ databases, $M$ of which contain the keyword, and us choosing $K$ databases. Let $X$ be the number of the $K$ databases chosen that contain the keyword. For $P(X = x)$, Henry's and my logic give the following answers.

$$\text{ Henry: } P(X = x) = \frac{\binom{M}{x}\binom{N-M}{K-x}}{\binom{N}{K}}; \text{ me: } P(X = x) = \frac{\binom{K}{x} \binom{N-K}{M-x}}{\binom{N}{M}}.$$

But

$$\frac{\binom{M}{x}\binom{N-M}{K-x}}{\binom{N}{K}} = \frac{M!}{x! (M-x)!} \frac{(N-M)!}{(K-x)! (N-M-K+x)!} \frac{K! (N-K)!}{N!}$$ $$= \frac{K!}{x! (K-x)!} \frac{(N-K)!}{(M-x)! (N-K-M+x)!} \frac{M! (N-M)!}{N!} = \frac{\binom{K}{x} \binom{N-K}{M-x}}{\binom{N}{M}}.$$

Thus the two answers give the same result.


(Original answer.)

I would approach it differently. Let $X$ be the number of the first four databases in which the keyword is found. You want $P(X \geq 2)$.

$$P(X = 2) = \frac{\binom{4}{2} \binom{5}{3}}{\binom{9}{5}}$$ because there are $\binom{4}{2}$ ways that 2 of the first 4 databases can contain the keyword, $\binom{5}{3}$ ways that 3 of the last 5 databases can contain the keyword, and $\binom{9}{5}$ ways that 5 of the 9 total databases to contain the keyword.

I'll let you calculate $P(X = 3)$ and $P(X = 4)$.

(FYI, $X$ has what's called a hypergeometric distribution.)

  • 0
    So, in general: $P(X=k)=\frac{\binom{4}{k} \binom{5}{5-k}}{\binom{9}{5}} $2011-09-10
  • 0
    @Jess: Yep. That's the form of the probability mass function for a hypergeometric distribution.2011-09-10
  • 0
    I feel that there is something wrong above! Are we choosing 4 items and not 5?!2011-09-10
  • 0
    @Jess: I was looking at the problem from the point of view of placing 5 keywords into 9 databases. So we are choosing 5 items (databases) in which to place the keywords.2011-09-10
3

You are going to have to work out the probabilities of getting the keyword exactly $n$ times, and either add up the probabilities for $n=2$, $3$ or $4$, or subtract from $1$ the probabilities for $n=1$ or $0$.

If $N$ is the number of dictionaries where the keyword is found then $$\Pr(N=n) = \frac{{5 \choose n}{4 \choose 4-n}}{9 \choose 4}$$ since you have to choose $n$ dictionaries from $5$ with the keyword and $4-n$ from $4$ without, choosing $4$ from $9$ overall.

So for $n=0,1,2,3,4$ this gives probabilities $\frac{1}{126},\frac{20}{126},\frac{60}{126},\frac{40}{126},\frac{5}{126}$.

Add up the last three (and change to lowest terms) and you have your solution.

  • 0
    @Mike: I don't know: the value is the same but the words are not. You have ${5 \choose 3}$ where I have ${5 \choose 2}$, and you say "ways that 3 of the last 5 databases can contain the keyword" while I say "you have to choose 2 dictionaries from 5 with the keyword". Similarly you have ${9 \choose 5}$ where I have ${9 \choose 4}$, and you say "ways that 5 of the 9 total databases to contain the keyword" while I say "choosing 4 from 9 overall".2011-09-10
  • 0
    Suppose there were only 2 databases (rather than 5) that contain the keyword. If I'm following your logic correctly, your answer to that modified problem would be $P(X = 2) = \binom{2}{2} \binom{7}{2}/\binom{9}{4}$ (2 from 2 with keyword, 2 from 7 without keyword, and 4 from 9 overall). The answer, though, is $\binom{4}{2}/\binom{9}{2}$ (the number of ways to distribute two keywords among a set of 4 divided by the number of ways to distribute two keywords among a set of 9). So I'm not sure that your logic works in general. (Or am I missing something?)2011-09-10
  • 0
    So I guess it's not the same as my solution after all. (Deleting earlier comment to that effect.)2011-09-10
  • 0
    In general, if there are $d$ dictionaries and $k$ of them have keywords, and you choose $m$ of them then $\Pr(N=n) = \dfrac{{k \choose n}{d-k \choose m-n}}{d \choose n}$.2011-09-10
  • 0
    Huh. Even in the example I sort of picked at random, our logic still gives the same numerical answer! Perhaps our answers are equivalent after all, even when they don't look like it. Unfortunately, I don't have time right now to pursue this any further. +1 from me.2011-09-10
  • 0
    I've added a proof that our answers give the same result.2011-09-10
0

Let $t=9$ be total number of databases, of which $n=4$ have no and $k=5$ have the keyword.

Let $c=4$ databases are chosen and $s$ databases have the keyword.

The question is asking to find $s\ge 2$, i.e. $s=\color{red}2,\color{green}3,\color{blue}4$.

There are ${t\choose c}={9\choose 4}$ ways to choose $c$ databases from total $t$ databases.

Case 1: There are ${k\choose 2}={5\choose \color{red}2}$ ways to choose from $k$ and ${n\choose 2}={4\choose 2}$ ways from $n$, hence: ${5\choose 2}{4\choose 2}$.

Case 2: There are ${k\choose 3}={5\choose \color{green}3}$ ways to choose from $k$ and ${n\choose 1}={4\choose 1}$ ways from $n$, hence: ${5\choose 3}{4\choose 1}$.

Case 3: There are ${k\choose 4}={5\choose \color{blue}4}$ ways to choose from $k$ and ${n\choose 0}={4\choose 0}$ ways from $n$, hence: ${5\choose 4}{4\choose 0}$.

Hence, the required probability is: $$P(s\ge 2)=P(s=2)+P(s=3)+P(s=4)=\\ \frac{{5\choose 2}{4\choose 2}}{9\choose 4}+\frac{{5\choose 3}{4\choose 1}}{9\choose 4}+\frac{{5\choose 4}{4\choose 0}}{9\choose 4}=\\ \frac{60+40+5}{126}=\frac{35}{42}\approx 0.83.$$ Note that you can also use the complement, i.e.: $$P(s\ge 2)=1-P(s=0)-P(s=1).$$

Thus, in general, the formula is: $$P(s\ge r)=\sum_{i=r}^c P(s=i)=\sum_{i=r}^c \frac{{k\choose i}{n\choose c-i}}{{t\choose c}}.$$

Keys: $t$-total, $n$-no keyword, $k$-keyword, $c$-chosen, $s$-success, $r$-minimum required success.