Reservoir Sampling
Reservoir sampling is a family of randomized algorithms for selecting a fixed-size random sample from a data stream of unknown or unbounded length. It ensures each element has an equal probability of being selected while maintaining constant memory usage.
Understanding reservoir sampling
Reservoir sampling solves the fundamental problem of selecting a uniform random sample of k items from a data stream without knowing its total size in advance. The algorithm maintains a "reservoir" of k items and probabilistically decides whether to replace existing samples as new items arrive.
The basic algorithm works as follows:
1. Fill reservoir with first k items2. For each subsequent item i (where i > k):- Generate random number j between 1 and i- If j ≤ k:Replace item at position j in reservoir with new item
This ensures that at any point, each item seen so far has an equal probability of being in the reservoir.
Next generation time-series database
QuestDB is an open-source time-series database optimized for market and heavy industry data. Built from scratch in Java and C++, it offers high-throughput ingestion and fast SQL queries with time-series extensions.
Mathematical foundation
The key property of reservoir sampling is that after processing n items, each item has exactly k/n probability of being in the final sample. This can be proven through:
For the simple case of k=1 (selecting a single random item):
Next generation time-series database
QuestDB is an open-source time-series database optimized for market and heavy industry data. Built from scratch in Java and C++, it offers high-throughput ingestion and fast SQL queries with time-series extensions.
Applications in time-series databases
In time-series databases, reservoir sampling is particularly useful for:
- Downsampling - Creating representative samples of high-frequency data while preserving statistical properties
- Anomaly detection - Maintaining reference distributions for comparison
- Performance optimization - Reducing data volume while maintaining analytical validity
Variants and optimizations
Algorithm L
For cases where k is large, Algorithm L provides better performance by reducing the number of random numbers generated:
- Skip S items according to geometric distribution
- Replace randomly chosen reservoir item
- Repeat
The skip formula is:
where U is uniform random (0,1)
Weighted reservoir sampling
For cases where items have different importance weights:
This variant ensures items with higher weights have proportionally higher chances of being selected.
Next generation time-series database
QuestDB is an open-source time-series database optimized for market and heavy industry data. Built from scratch in Java and C++, it offers high-throughput ingestion and fast SQL queries with time-series extensions.
Implementation considerations
When implementing reservoir sampling in production systems:
- Thread safety - Ensure atomic updates when sampling from multiple streams
- Memory efficiency - Use appropriate data structures for the reservoir
- Random number generation - Choose high-quality generators for statistical validity
- Deterministic seeding - Consider reproducibility requirements
Best practices
- Use reservoir sampling when dealing with streaming data of unknown size
- Consider weighted variants when data points have varying importance
- Implement efficient random number generation
- Document sampling parameters for reproducibility
- Validate statistical properties of samples periodically
Related terms
- Downsampling - Common application of reservoir sampling
- Data Streaming - Context where reservoir sampling is often used
- Statistical Analysis - Framework for understanding sampling properties