1
$\begingroup$

I am reading this algorithm in these notes for counting the number of distinct items in a stream.

From my understanding, the basic idea is that if such number is big enough, the distance between the numbers generated by the hash function will be, on average, the same and equal to $\frac{1}{k+1}$, where $k$ is the number of distinct element. Hence, one can derive $k$ from that. However, I don't follow the reasoning in slide $6$. Why would $2^R$ be "around" $m$?

1 Answers 1

1

In words, I'd say the two preceding lines in the slide say "for a value $v$ much larger than $m$, it is very unlikely that $2^R$ will exceed $v$" respectively "for a value $v$ much smaller than $m$, it is very likely that $2^R$ will exceed $v$" (take $v=2^r$ both times). It does not seem an enormous strecth to conclude that most of the time $2^R$ will not be much larger than $m$ (first clause) but not much smaller than $m$ either (second clause), in other words, fairly close to $m$.

  • 0
    The authors, in the PCSA paper, mention in passing in the Conclusion section: "the binary logarithm of the minimal hashed value encountered (hashed values being considered are real [O; 11 numbers) provides an approximation to log, lln, but the resulting algorithm appears to be slightly less accurate than PCSA". Cannot copy correctly from PDF, sorry about formatting. Sounds like they already knew about what you're onto, as is usual with the elders.2016-03-25