Cardinality Estimation

SUMMARY

Cardinality estimation is a technique used to approximate the number of distinct values in a dataset without storing every unique value in memory. In time-series databases, accurate cardinality estimates are crucial for query optimization, resource allocation, and understanding data patterns while maintaining system performance.

Understanding cardinality estimation

Cardinality estimation addresses a fundamental challenge in database systems: determining how many unique values exist in a dataset without exhaustively counting them. This is particularly important in time-series databases where datasets can be massive and continuous.

For example, in a financial trading system monitoring stock transactions, you might need to estimate:

Number of unique traders per day
Distinct symbols traded in a time window
Unique price levels observed

Exact counting would require storing every value in memory, which becomes impractical at scale. Cardinality estimation algorithms provide efficient approximations with controlled error rates.

Next generation time-series database

QuestDB is an open-source time-series database optimized for market and heavy industry data. Built from scratch in Java and C++, it offers high-throughput ingestion and fast SQL queries with time-series extensions.

Try live demo Read documentation

How cardinality estimation works

Most modern cardinality estimation techniques use probabilistic data structures that maintain a compact representation of the observed values. Popular algorithms include:

HyperLogLog: Uses bit pattern observations to estimate cardinality
Linear Counting: Employs a bitmap and probability theory
LogLog Counting: Provides estimates using logarithmic space complexity

These algorithms trade perfect accuracy for dramatic improvements in memory efficiency.

Next generation time-series database

Try live demo Read documentation

Applications in time-series databases

Cardinality estimation serves several critical functions in time-series databases:

Query optimization

Helps the query planner choose efficient execution strategies
Estimates result set sizes for join operations
Optimizes memory allocation for aggregations

Resource management

Prevents unexpected memory spikes from high-cardinality operations
Guides partitioning decisions for better data distribution
Supports capacity planning and scaling decisions

Performance monitoring

SELECT 
    symbol,
    count(DISTINCT price) AS unique_prices,
    count(*) AS total_trades
FROM trades
WHERE timestamp > dateadd('h', -1, now())
GROUP BY symbol;

Considerations and tradeoffs

When implementing cardinality estimation, several factors need consideration:

Accuracy vs. memory usage

Higher accuracy requires more memory
Error bounds typically follow statistical distributions
Most algorithms offer configurable precision/memory tradeoffs

Update performance

Some algorithms support incremental updates
Others require periodic rebuilding
Write-heavy workloads may need specialized approaches

Scale requirements

Small datasets might not benefit from estimation
Very large datasets require careful algorithm selection
Distributed systems need consistent estimation across nodes

Best practices

Choose appropriate error bounds for your use case
Monitor estimation accuracy over time
Consider using different algorithms for different cardinality ranges
Test with realistic data volumes and patterns
Account for estimation errors in dependent systems

Remember that cardinality estimation is about finding the right balance between accuracy and resource usage for your specific requirements.