Reservoir Sampling

SUMMARY

Reservoir sampling is a family of randomized algorithms for selecting a fixed-size random sample from a data stream of unknown or unbounded length. It ensures each element has an equal probability of being selected while maintaining constant memory usage.

Understanding reservoir sampling

Reservoir sampling solves the fundamental problem of selecting a uniform random sample of k items from a data stream without knowing its total size in advance. The algorithm maintains a "reservoir" of k items and probabilistically decides whether to replace existing samples as new items arrive.

The basic algorithm works as follows:

1. Fill reservoir with first k items
2. For each subsequent item i (where i > k):
   - Generate random number j between 1 and i
   - If j ≤ k:
     Replace item at position j in reservoir with new item

This ensures that at any point, each item seen so far has an equal probability of being in the reservoir.

Next generation time-series database

QuestDB is an open-source time-series database optimized for market and heavy industry data. Built from scratch in Java and C++, it offers high-throughput ingestion and fast SQL queries with time-series extensions.

Try live demo Read documentation

Mathematical foundation

The key property of reservoir sampling is that after processing n items, each item has exactly k/n probability of being in the final sample. This can be proven through:

$P(\text{item}_i \text{ in reservoir}) = \frac{k}{n}$

For the simple case of k=1 (selecting a single random item):

$P(\text{selecting item}_i) = \frac{1}{i} \prod_{j=i+1}^{n} (1-\frac{1}{j}) = \frac{1}{n}$

Next generation time-series database

Try live demo Read documentation

Applications in time-series databases

In time-series databases, reservoir sampling is particularly useful for:

Downsampling - Creating representative samples of high-frequency data while preserving statistical properties
Anomaly detection - Maintaining reference distributions for comparison
Performance optimization - Reducing data volume while maintaining analytical validity

Variants and optimizations

Algorithm L

For cases where k is large, Algorithm L provides better performance by reducing the number of random numbers generated:

Skip S items according to geometric distribution
Replace randomly chosen reservoir item
Repeat

The skip formula is:

$S = \lfloor \log(U)/\log(1-\frac{k}{n}) \rfloor$

where U is uniform random (0,1)

Weighted reservoir sampling

For cases where items have different importance weights:

$P(\text{selection}_i) \propto w_i$

This variant ensures items with higher weights have proportionally higher chances of being selected.

Next generation time-series database

Try live demo Read documentation

Implementation considerations

When implementing reservoir sampling in production systems:

Thread safety - Ensure atomic updates when sampling from multiple streams
Memory efficiency - Use appropriate data structures for the reservoir
Random number generation - Choose high-quality generators for statistical validity
Deterministic seeding - Consider reproducibility requirements

Best practices

Use reservoir sampling when dealing with streaming data of unknown size
Consider weighted variants when data points have varying importance
Implement efficient random number generation
Document sampling parameters for reproducibility
Validate statistical properties of samples periodically

Downsampling - Common application of reservoir sampling
Data Streaming - Context where reservoir sampling is often used
Statistical Analysis - Framework for understanding sampling properties