Merge-on-read

RedditHackerNewsX
SUMMARY

Merge-on-read is a data storage optimization strategy that defers the merging of base data and change data until read time, prioritizing write performance over read performance. This approach is particularly valuable in time-series databases and data lake architectures where write-heavy workloads are common.

How merge-on-read works

Merge-on-read maintains two data structures:

  1. A base data layer containing the original data
  2. A delta layer containing subsequent modifications

When a query is executed, the system merges these layers on-the-fly to provide the current view of the data.

Comparison with copy-on-write

While copy-on-write performs merging during write operations, merge-on-read shifts this cost to read time:

  • Write performance: Faster writes as changes are only recorded in the delta layer
  • Read performance: Higher latency as merging occurs during query execution
  • Storage efficiency: More space-efficient as it avoids creating new copies during updates

Next generation time-series database

QuestDB is an open-source time-series database optimized for market and heavy industry data. Built from scratch in Java and C++, it offers high-throughput ingestion and fast SQL queries with time-series extensions.

Applications in time-series data

In time-series databases, merge-on-read is particularly useful for:

  1. High-frequency data ingestion where write performance is critical
  2. Late-arriving data that needs to be merged with historical records
  3. Systems with append-heavy workloads

For example, in financial market data:

Next generation time-series database

QuestDB is an open-source time-series database optimized for market and heavy industry data. Built from scratch in Java and C++, it offers high-throughput ingestion and fast SQL queries with time-series extensions.

Optimization techniques

Several strategies can optimize merge-on-read performance:

  1. Compaction thresholds: Automatically merging delta files when they exceed size limits
  2. Caching: Maintaining frequently accessed merged results
  3. Parallel merging: Distributing merge operations across multiple threads

Use cases and considerations

Merge-on-read is ideal for:

  • Real-time analytics platforms
  • Event sourcing systems
  • Applications with high write volumes
  • Scenarios where read latency is less critical than write performance

Consider these factors when implementing merge-on-read:

  1. Query patterns and frequency
  2. Write-to-read ratio
  3. Storage capacity
  4. Acceptable read latency

The strategy works well with data lake architectures and modern table formats that support versioning and time travel capabilities.

Subscribe to our newsletters for the latest. Secure and never shared or sold.