The following is a question from lecture notes and although not assigned homework, I am trying to solve it.
Assume that we have a collection $C$ of $N$ documents and a query $q$. There are $R_{q}$ relevant documents in the collection for this query, with $R_{q} < N$. If we select $R_{q}$ documents randomly, what is the probability that we have at least 40% recall and 30% precision?
Note: Does it matter if the documents are picked one after another or all at once? The question does not specify it, but in information retrieval the answer to the query is returned as a set to the user (like in search engines).
Background: Given an answer $A$ , which is a collection of documents, to a query $q$ with $R_{q}$ relevant documents in a collection $C$ with $N$ documents, precision and recall are defined as follows:
$recall(A,C,q) = \frac{|A_{r}|}{|R_{q}|}$ , where $A_{r} \subseteq A$ are the documents in the answer that are relevant to the query.
$precision(A,C,q) = \frac{|A_{r}|}{|A|}$
Approach: I have tried an approach but I think it is incorrect. Given the parameters of the problem, we have that $|A| = R_{q}$ , therefore,
$recall(A,C,q) \geq 0.4 => |A_{r}| \geq 0.4R_{q}$
$precision(A,C,q) \geq 0.3 => |A_{r} \geq 0.3|A| => |A_{r}| \geq 0.3 |R_{q}|$
In this case, the second condition is redundant. Given that $P(\textrm{document is relevant}) = \frac{R_{q}}{N}$ I deduced that I must find the probability of $0.4R_{q}$ documents must be relevant, therefore the requested probability is
$P( recall(A,C,Q) \geq 0.4) = \frac{ R_{q}(R_{q}-1)\ldots(R_{q}-0.4R_{q})}{N(N-1)\ldots(N-0.4R_{q})}$
- Is my approach correct? If not, can you please pinpoint my error?
- If not, what is a correct approach?