0
$\begingroup$

I will be getting a stream of items. I also know the sample size I need. When an item comes, I need to decide whether it will be in will be in the sample or not. I will not get second chance to either remove or add this item. But at the end, I should get samples with sample size.

I looked at reservoir sampling - it can be distributed. But it creates samples when everything has come. Also, we remove an item which was already in sample set when a new item comes.

Is there an algorithm which works for my case?

1 Answers 1

0

You have said you know the sample size you want (let's call it $n$), but you have not said whether you know the total size of the stream.

  • If you do know the total size of the stream (let's call it $m$), then take the $i$th item with probability $\frac{n-j}{m-i+1}$ where $j \le n$ is the number of items you have previously taken. This will give you exactly $n$ items, each with marginal probability $\frac{n}{m}$, and you will have sampled $n$ times in the algorithm

  • If you do not know the total size of the stream, then take the first $n$ items in the stream. Then take the $i$th item with probability $\frac{n}{i}$ and use it to replace one of the existing $n$ items you are holding (choosing between the $n$ which to replace with equal probability $\frac1n$). This will give you exactly $n$ items in the end, each with marginal probability $\frac{n}{m}$, though you might expect to have sampled about $n+n\log_e(\frac{m}{n})$ terms in the algorithm