Getting started
Reservoir sampling concerns with choosing \(k\) items (i.e., a random sample of size \(k\) or \(k\)-sample for short) from a population of size \(n\) (\(n > k\)) without replacement such that each chosen item is selected randomly with equal chance from the population.
Q1. What is the probability of an item being selected in the final k-sample?
Consider an item \(x\) from the population, let \(P_i(x)\) denote the probability that \(x\) is selected at the i-th round.
As we sample without replacement, \(x\) is selected at the i-th round means \(x\) isn't selected in previous (i-1) rounds.
- \(P_1(x) = \frac{1}{n}\)
- \(P_2(x) = (1-P_1(x)) \times \frac{1}{n-1} = \frac{n-1}{n} \times \frac{1}{n-1} = \frac{1}{n}\)
- \(P_3(x) = (1-P_1(x)-P_2(x)) \times \frac{1}{n-2} = \frac{1}{n}\)
- \(\cdots\)
- \(P_k(x) = \frac{1}{n}\)
Thus, \(P(x~\text{is selected}) = \frac{k}{n}\).
Stream setting
What if the population size is not known in advance as items coming in a stream (which may be infinite) and/or all items can't be stored in memory? Still the requirement is for a new coming item, the chance it is selected in the current sample should be equal to the chances of any item coming before, i.e., if we have seen \(n\) items so far then the chance of a seen item being selected in the \(k\)-sample is \(k/n\). The first \(k\) items are always selected. Here comes the idea of sampling with adaptive probability, which I found very elegant!
Sampling process
- Keep the first \(k\) items in our sample
- For the \(i^{th}\) item (\(i \geq k+1\)), select the item and thus replace an existing item in the current sample with it with probability \(p = k/i\).
How does it work?
We can prove by induction that at the \(i^{th}\) round (\(i \geq k+1\)), the probability that a seen item being selected is \(k/i\). Here is the sketch:
- assume the claim is true at the round \(i\)
-
The new item \(x_{i+1}\) comes at the round \(i+1\), we select it into our current sample with \(p = k/(i+1)\). What is the probability \(p(y)\) of an existing item \(y\) (came in previous rounds) still being selected at this round?
-
before this round, \(p_{before}(y) = k/i\) per the induction assumption
-
for \(y\) still being selected, it needs to survive the round \(i+1\), which is one of the 2 following exclusive cases:
- \(x_{i+1}\) is not selected: \(p_1=(1-\frac{k}{i+1}) = \frac{i+1-k}{i+1}\)
- \(x_{i+1}\) is selected but it doesn't replace \(y\): \(p_2=\frac{k}{i+1} \times \frac{k-1}{k} = \frac{k-1}{i+1}\)
- thus, \(p(y) = p_{before}(y) \times (p_1+p_2) = \frac{k}{i} \times \frac{i}{i+1} = k/(i+1)\)
We just showed that claim is also true at the round \(i+1\). Hence the claim is true for all rounds thanks to the induction principle.
Implementation
- Add the first \(k\) items to the sample list \(L\) in coming order (1-based indexing)
-
Round \(i\): \(x_i\) comes, draw a random integer \(j \in [1,i]\), e.g.,
random.randint(1, i)in Python. -
if \(j <= k\), replace L[j] with the current item \(x_i\), i.e., \(L[j] := x_i\)
- else skip the current item