Reservoir Sampling

RedditHackerNewsX
SUMMARY

Reservoir sampling is a family of randomized algorithms for selecting a fixed-size random sample from a data stream of unknown or unbounded length. It ensures each element has an equal probability of being selected while maintaining constant memory usage.

Understanding reservoir sampling

Reservoir sampling solves the fundamental problem of selecting a uniform random sample of k items from a data stream without knowing its total size in advance. The algorithm maintains a "reservoir" of k items and probabilistically decides whether to replace existing samples as new items arrive.

The basic algorithm works as follows:

1. Fill reservoir with first k items
2. For each subsequent item i (where i > k):
- Generate random number j between 1 and i
- If j ≤ k:
Replace item at position j in reservoir with new item

This ensures that at any point, each item seen so far has an equal probability of being in the reservoir.

Next generation time-series database

QuestDB is an open-source time-series database optimized for market and heavy industry data. Built from scratch in Java and C++, it offers high-throughput ingestion and fast SQL queries with time-series extensions.

Mathematical foundation

The key property of reservoir sampling is that after processing n items, each item has exactly k/n probability of being in the final sample. This can be proven through:

P(itemi in reservoir)=knP(\text{item}_i \text{ in reservoir}) = \frac{k}{n}

For the simple case of k=1 (selecting a single random item):

P(selecting itemi)=1ij=i+1n(11j)=1nP(\text{selecting item}_i) = \frac{1}{i} \prod_{j=i+1}^{n} (1-\frac{1}{j}) = \frac{1}{n}

Next generation time-series database

QuestDB is an open-source time-series database optimized for market and heavy industry data. Built from scratch in Java and C++, it offers high-throughput ingestion and fast SQL queries with time-series extensions.

Applications in time-series databases

In time-series databases, reservoir sampling is particularly useful for:

  1. Downsampling - Creating representative samples of high-frequency data while preserving statistical properties
  2. Anomaly detection - Maintaining reference distributions for comparison
  3. Performance optimization - Reducing data volume while maintaining analytical validity

Variants and optimizations

Algorithm L

For cases where k is large, Algorithm L provides better performance by reducing the number of random numbers generated:

  1. Skip S items according to geometric distribution
  2. Replace randomly chosen reservoir item
  3. Repeat

The skip formula is:

S=log(U)/log(1kn)S = \lfloor \log(U)/\log(1-\frac{k}{n}) \rfloor

where U is uniform random (0,1)

Weighted reservoir sampling

For cases where items have different importance weights:

P(selectioni)wiP(\text{selection}_i) \propto w_i

This variant ensures items with higher weights have proportionally higher chances of being selected.

Next generation time-series database

QuestDB is an open-source time-series database optimized for market and heavy industry data. Built from scratch in Java and C++, it offers high-throughput ingestion and fast SQL queries with time-series extensions.

Implementation considerations

When implementing reservoir sampling in production systems:

  1. Thread safety - Ensure atomic updates when sampling from multiple streams
  2. Memory efficiency - Use appropriate data structures for the reservoir
  3. Random number generation - Choose high-quality generators for statistical validity
  4. Deterministic seeding - Consider reproducibility requirements

Best practices

  • Use reservoir sampling when dealing with streaming data of unknown size
  • Consider weighted variants when data points have varying importance
  • Implement efficient random number generation
  • Document sampling parameters for reproducibility
  • Validate statistical properties of samples periodically
Subscribe to our newsletters for the latest. Secure and never shared or sold.