Reservoir Sampling

Getting started

Reservoir sampling concerns with choosing \(k\) items (i.e., a random sample of size \(k\) or \(k\)-sample for short) from a population of size \(n\) (\(n > k\)) without replacement such that each chosen item is selected randomly with equal chance from the population.

Q1. What is the probability of an item being selected in the final k-sample?

Consider an item \(x\) from the population, let \(P_i(x)\) denote the probability that \(x\) is selected at the i-th round.

$$P(x~\text{is selected}) = \sum_{i=1}^kP_i(x)$$

As we sample without replacement, \(x\) is selected at the i-th round means \(x\) isn't selected in previous (i-1) rounds.

Thus, \(P(x~\text{is selected}) = \frac{k}{n}\).

Stream setting

What if the population size is not known in advance as items coming in a stream (which may be infinite) and/or all items can't be stored in memory? Still the requirement is for a new coming item, the chance it is selected in the current sample should be equal to the chances of any item coming before, i.e., if we have seen \(n\) items so far then the chance of a seen item being selected in the \(k\)-sample is \(k/n\). The first \(k\) items are always selected. Here comes the idea of sampling with adaptive probability, which I found very elegant!

Sampling process

How does it work?

We can prove by induction that at the \(i^{th}\) round (\(i \geq k+1\)), the probability that a seen item being selected is \(k/i\). Here is the sketch:

We just showed that claim is also true at the round \(i+1\). Hence the claim is true for all rounds thanks to the induction principle.

Implementation