Posting index and covering index

The posting index is an advanced index type for symbol columns that provides better compression, faster reads, and covering index support compared to the default bitmap index.

A covering index stores additional column values alongside the index entries, so queries that only need those columns can be answered entirely from the index without reading the main column files.

When to use the posting index

Use the posting index when:

  • You frequently filter on a symbol column (WHERE symbol = 'X')
  • Your queries select a small set of columns alongside the symbol filter
  • You want to reduce I/O by reading from compact sidecar files instead of full column files
  • You need efficient DISTINCT queries on a symbol column
  • You need efficient LATEST ON queries partitioned by a symbol column

The posting index is especially effective for high-cardinality symbol columns (hundreds to thousands of distinct values) and wide tables where reading full column files is expensive.

Creating a posting index

At table creation

Inline syntax (index defined alongside the column):

CREATE TABLE trades (
timestamp TIMESTAMP,
symbol SYMBOL INDEX TYPE POSTING,
exchange SYMBOL,
price DOUBLE,
quantity DOUBLE
) TIMESTAMP(timestamp) PARTITION BY DAY WAL;

Out-of-line syntax (index defined separately):

CREATE TABLE trades (
timestamp TIMESTAMP,
symbol SYMBOL,
exchange SYMBOL,
price DOUBLE,
quantity DOUBLE
), INDEX(symbol TYPE POSTING)
TIMESTAMP(timestamp) PARTITION BY DAY WAL;

With covering columns (INCLUDE)

CREATE TABLE trades (
timestamp TIMESTAMP,
symbol SYMBOL INDEX TYPE POSTING INCLUDE (exchange, price),
exchange SYMBOL,
price DOUBLE,
quantity DOUBLE
) TIMESTAMP(timestamp) PARTITION BY DAY WAL;

The INCLUDE clause specifies which columns are stored in the index sidecar files. Queries that only read these columns plus the indexed symbol column can be served entirely from the index.

tip

When you supply an INCLUDE clause, the designated timestamp is automatically appended to it if you did not already list it — you do not need to type it explicitly. INCLUDE (exchange, price) renders back as INCLUDE (exchange, price, timestamp) in SHOW CREATE TABLE after creation. Controlled by the cairo.posting.index.auto.include.timestamp server property (default true).

This auto-append only applies when an INCLUDE clause is given and the table has a designated timestamp. A bare INDEX TYPE POSTING (no INCLUDE) has no covering layer at all — SELECT timestamp, sym FROM t WHERE sym = 'X' reads the timestamp from the column file in that case. Tables without a designated timestamp (typically BYPASS WAL tables that omit the TIMESTAMP(...) clause) still work normally with posting indexes and INCLUDE; the auto-append simply has nothing to add.

note

The INCLUDE clause is only supported with inline column syntax in CREATE TABLE and with ALTER TABLE. The out-of-line INDEX(col TYPE POSTING) syntax and ALTER MATERIALIZED VIEW both reject INCLUDE.

Writing INDEX INCLUDE (...) (no explicit TYPE) is also accepted and implicitly creates a posting index — INCLUDE is only valid with POSTING, so the parser promotes the type for you.

On an existing table

ALTER TABLE trades
ALTER COLUMN symbol ADD INDEX TYPE POSTING INCLUDE (exchange, price);

Encoding options

The posting index supports three row ID encoding options with different compression and query performance characteristics:

SyntaxEncodingNotes
INDEX TYPE POSTINGAdaptive (default)Trials delta + Frame-of-Reference and Elias-Fano per key per stride and keeps the smaller output
INDEX TYPE POSTING EFElias-Fano onlyForces Elias-Fano even when delta + FoR would be smaller — useful for benchmarking
INDEX TYPE POSTING DELTADelta + Frame-of-Reference onlyForces delta + FoR even when Elias-Fano would be smaller — useful for benchmarking

Delta + Frame-of-Reference encoding stores each key's row IDs as per-key deltas, split into blocks of 64 with per-block Frame-of-Reference bitpacking. Round-robin or periodic distributions produce constant deltas (bitwidth 0), so this mode compresses them to near-zero. The trade-off is a per-key block-header overhead that hurts low-cardinality keys.

Elias-Fano (EF) encoding is a classic monotonic-sequence encoding: each key's sorted row IDs are split into low and high bit halves, with the high half stored as a unary-coded bit array and the low half as a fixed-width packed array. This typically produces denser output for keys with few values per stride and avoids the block-header overhead.

The adaptive (default) encoding trial-encodes each key with both delta + Frame-of-Reference and Elias-Fano per stride and picks whichever produces the smaller output. This is the right choice for almost all workloads — the explicit DELTA / EF variants exist mainly for benchmarking.

-- Default adaptive encoding (recommended for most workloads)
CREATE TABLE t1 (ts TIMESTAMP, s SYMBOL INDEX TYPE POSTING)
TIMESTAMP(ts) PARTITION BY DAY WAL;

-- Force Elias-Fano only (benchmarking)
CREATE TABLE t2 (ts TIMESTAMP, s SYMBOL INDEX TYPE POSTING EF)
TIMESTAMP(ts) PARTITION BY DAY WAL;

-- Force delta + Frame-of-Reference only (benchmarking)
CREATE TABLE t3 (ts TIMESTAMP, s SYMBOL INDEX TYPE POSTING DELTA)
TIMESTAMP(ts) PARTITION BY DAY WAL;

Covering index

The covering index is the most powerful feature of the posting index. When all columns in a query's SELECT list are either:

  • The indexed symbol column itself (from the WHERE clause)
  • Listed in the INCLUDE clause

...the query engine reads data directly from the index sidecar files, bypassing the main column files entirely. This is significantly faster for selective queries on wide tables.

caution

The async group-by and filter code paths through the covering index are currently slower than the regular plan in some workloads. A follow-up release will close this gap, and the optimizer will continue to improve as more feedback comes in.

If you notice a query slowdown after EXPLAIN shows it has started picking the covering path, opt that query out with /*+ no_covering */ or /*+ no_index */ while the optimizations land.

Supported column types in INCLUDE

All column types except the indexed symbol column itself can be included:

TypeCompressionNotes
BOOLEAN, BYTE, GEOBYTE, DECIMAL8Frame-of-Reference bitpacking≤1 byte per value (worst case)
SHORT, CHAR, GEOSHORT, DECIMAL16Frame-of-Reference bitpacking≤2 bytes per value
INT, IPv4, GEOINT, DECIMAL32Frame-of-Reference bitpacking≤4 bytes per value
FLOATALP (Adaptive Lossless floating-Point)Lossless float compression
LONG, DATE, GEOLONG, DECIMAL64Frame-of-Reference bitpacking≤8 bytes per value
TIMESTAMPLinear-prediction + Frame-of-ReferenceDesigned for monotonic timestamps
DOUBLEALP (Adaptive Lossless floating-Point)Lossless float compression
SYMBOLFrame-of-Reference bitpackingStored as integer key, resolved at query time
UUID, DECIMAL128Raw copy16 bytes per value
LONG256, DECIMAL256Raw copy32 bytes per value
VARCHAR, STRINGFSST compressed (≥4 KB strides)Typically 2–5× compression on repetitive text
BINARYLength-prefixed raw bytesVariable-width, no compression
Arrays (DOUBLE[], INT[], etc.)Length-prefixed raw bytesVariable-width, no compression

How to choose INCLUDE columns

Include columns that you frequently select together with the indexed symbol:

-- If your typical queries look like this:
SELECT timestamp, price, quantity FROM trades WHERE symbol = 'AAPL';

-- Then include those columns (timestamp is auto-included as designated timestamp):
CREATE TABLE trades (
timestamp TIMESTAMP,
symbol SYMBOL INDEX TYPE POSTING INCLUDE (price, quantity),
exchange SYMBOL,
price DOUBLE,
quantity DOUBLE,
-- other columns not needed in hot queries
raw_data VARCHAR,
metadata VARCHAR
) TIMESTAMP(timestamp) PARTITION BY DAY WAL;
tip

Only include columns that appear in your most frequent queries. Each included column adds storage overhead and slows down writes slightly. Columns not in the INCLUDE list can still be queried — they just won't benefit from the covering optimization and will be read from column files.

Inspecting indexes with SHOW COLUMNS

SHOW COLUMNS reports the index type and covered column list per column:

  • indexType is POSTING, POSTING DELTA, POSTING EF, BITMAP, or empty for non-indexed columns.
  • indexInclude lists covered column names — including the auto-included designated timestamp.

The table_columns() function exposes the same fields programmatically.

Verifying covering index usage

Use EXPLAIN to verify that a query uses the covering index:

EXPLAIN SELECT timestamp, price FROM trades WHERE symbol = 'AAPL';

If the covering index is used, the plan shows CoveringIndex:

SelectedRecord
CoveringIndex on: symbol with: timestamp, price
filter: symbol='AAPL'

IN-list filters render as filter: symbol IN ['AAPL','GOOGL','MSFT']. LATEST ON queries that hit the covering path show an op: latest annotation and have no SelectedRecord wrapper:

CoveringIndex op: latest on: symbol with: timestamp, price
filter: symbol='AAPL'

SELECT DISTINCT does not need to read covered values, so it shows up as PostingIndex op: distinct rather than CoveringIndex:

PostingIndex op: distinct on: symbol
Frame forward scan on: trades

When you add a filter on a covered column, an Async Filter is layered above the covering index — the predicate values are read from the sidecar, not the column file:

SelectedRecord
Async Filter workers: N
filter: 100<price
CoveringIndex on: symbol with: price
filter: symbol='AAPL'

If you see DeferredSingleSymbolFilterPageFrame or PageFrame instead, the query is reading from column files. This happens when the SELECT list includes columns not in the INCLUDE list, or when the WHERE clause doesn't filter on the indexed symbol.

Comparison with bitmap index

For a side-by-side feature comparison and guidance on choosing between the two index types, see Choosing an index type on the indexes overview page.

In end-to-end benchmarks (geomean across five workloads, sealed indexes), the posting index is roughly 13× smaller than the bitmap index and 1.3–1.5× faster on point, range, and full-scan reads. Writes are ~9% slower than the bitmap index for the index path itself; sidecar writes add overhead proportional to the number and type of INCLUDE columns.

Query patterns accelerated

Point queries (WHERE symbol = 'X')

-- Reads from sidecar if price is in INCLUDE
SELECT price FROM trades WHERE symbol = 'AAPL';

Point queries with additional filters

If the additional filter columns are also in INCLUDE, the covering index streams matching rows and an Async Filter applies the extra predicate on top — the predicate values are read from the sidecar, not the column file:

-- Covering index + filter on covered column
SELECT price FROM trades WHERE symbol = 'AAPL' AND price > 100;

IN-list queries

-- Multiple keys, still uses covering index
SELECT price FROM trades WHERE symbol IN ('AAPL', 'GOOGL', 'MSFT');

LATEST ON queries

-- Latest row per symbol, reads from sidecar
SELECT timestamp, symbol, price
FROM trades
WHERE symbol = 'AAPL'
LATEST ON timestamp PARTITION BY symbol;

DISTINCT queries

-- Enumerates keys from index metadata, O(keys x partitions) instead of full scan
SELECT DISTINCT symbol FROM trades;

-- Also works with timestamp filters
SELECT DISTINCT symbol FROM trades WHERE timestamp > '2024-01-01';

COUNT queries

-- Plan: Count over CoveringIndex, no column data read
SELECT COUNT(*) FROM trades WHERE symbol = 'AAPL';

Aggregate queries on covered columns

-- Aggregates over a covered column read from the sidecar instead of
-- the column file
SELECT count(*), min(price), max(price)
FROM trades
WHERE symbol = 'AAPL';

SQL optimizer hints

Two hints opt a query out of the covering and/or index paths for benchmarking or troubleshooting:

  • no_covering — read from column files instead of the covering sidecar
  • no_index — disable index usage entirely (implies no_covering)

See Index hints for syntax, semantics, and examples.

Trade-offs

Storage

The posting index itself is very compact (~1 byte per indexed value, vs. ~15 bytes per value for the bitmap index). The covering sidecar adds storage proportional to the included columns:

  • DOUBLE, FLOAT: ALP (Adaptive Lossless floating-Point), backed by Frame-of-Reference bitpacking with an exception list for outliers.
  • TIMESTAMP: linear-prediction header with Frame-of-Reference residual bitpacking — designed for monotonic timestamp data.
  • Other fixed-width integer types (BOOLEAN, BYTE, SHORT, CHAR, INT, LONG, DATE, IPv4, GEO*, DECIMAL8–DECIMAL64, SYMBOL keys): Frame-of-Reference bitpacking sized to the column's natural width, so the worst case is the column-file byte size and typical case is much smaller.
  • UUID, LONG256, DECIMAL128, DECIMAL256: stored raw at full width with a small count header.
  • VARCHAR, STRING: FSST-compressed once a stride exceeds 4 KB of raw data; typically 2–5× smaller than the column file.
  • BINARY and arrays: length-prefixed raw bytes (no compression).

Write performance

Write overhead depends on the number and type of INCLUDE columns:

  • Posting index without INCLUDE: ~9% slower than the bitmap index for the index path itself (delta + Frame-of-Reference encoding vs. simple append).
  • Posting index with fixed-width INCLUDE: additional sidecar write cost proportional to the number of columns; values are batched and compressed at seal time.
  • Posting index with VARCHAR / STRING / BINARY / ARRAY INCLUDE: pays the full variable-width copy cost per row plus an FSST symbol-table rebuild per seal for VARCHAR / STRING.

Query performance improvements typically far outweigh the write cost for read-heavy workloads.

Memory

The posting index uses native memory for encoding/decoding buffers. Each FSST-compressed VARCHAR or STRING column carries a ~2.3 KB symbol table that is loaded alongside the sidecar at read time and easily fits in L1 cache; per-reader decompression buffers are also small.

Architecture

The posting index stores data in three file types per partition:

  • .pk — Key file: double-buffered metadata pages with the per-key generation directory; readers see consistent snapshots via a seqlock protocol.
  • .pv — Value file: row IDs encoded as either delta + Frame-of-Reference bitpacking or Elias-Fano (depending on the index's encoding variant), organised into stride-indexed generations.
  • .pci + .pc0, .pc1, … — Sidecar files: covered column values stored alongside the posting list. The single .pci header lists the covered columns by writer index (PCI1 magic, plus the coverCount used by readers to size their sidecar mappings). Each .pcN (with txn-segment suffix on disk, e.g. s.pc0.0.0) holds the encoded data for one INCLUDE column. The auto-included designated timestamp counts as one of the covered columns and gets its own .pcN file.

Generations and sealing

Data is written incrementally as generations (one per commit). Each generation contains a sparse block of key→rowID mappings. Periodically, generations are sealed into a single dense generation with stride-indexed layout for optimal read performance.

Sealing happens automatically in two cases:

  • When a partition is closed — retired by the next partition becoming active — it is compacted if it carries more than cairo.posting.seal.gen.threshold (default 16) unsealed generations.
  • While a partition is active, WAL fast-lag commits append a new sparse generation in place rather than re-sealing. The active partition can therefore carry many more generations than the threshold; an inline seal is forced only when an internal generation cap is reached.

In practice the 16-generation threshold mostly governs partition retirement.

Sealed data is written stride-by-stride (256 keys per stride). Within the delta + Frame-of-Reference family, the writer trial-encodes each stride in two sub-layouts and keeps whichever produces fewer bytes:

  • Delta sub-layout — per-key delta encoding, then per-block Frame-of-Reference bitpacking. Wins when there are roughly ten or more values per key, where the delta distribution lets each block use a small bitwidth.
  • Flat sub-layout — stride-wide Frame-of-Reference with a single base and bitwidth, plus a prefix-count array for per-key slicing. Wins when keys are sparse (roughly eight or fewer values per key) by eliminating per-key metadata.

These are internal to delta + Frame-of-Reference and are independent of the SQL DELTA / EF encoding variants described above. When the resulting bitwidth is 8, 16, or 32, decoding uses a native AVX2 fast path; other bit widths fall back to a Java decoder.

FSST compression for strings

VARCHAR and STRING columns in the INCLUDE list are compressed using FSST (Fast Static Symbol Table) compression during sealing once a stride exceeds 4 KB of raw data. FSST replaces frequently occurring 1–8 byte patterns with single-byte codes, typically achieving 2–5× compression on string data with repetitive patterns. The 2.3 KB symbol table fits in L1 cache and gives stateless O(1) per-value decode.

The FSST symbol table is trained per seal and stored inline in the sidecar file. Decompression is transparent to the query engine.

Limitations

warning
  • INCLUDE cannot list the indexed symbol column itself.
  • INCLUDE is not supported with out-of-line INDEX(col ...) syntax — use inline column syntax or ALTER TABLE instead.
  • CAPACITY is not supported for posting indexes (bitmap only).
  • The covering path engages only when the query filters on the indexed symbol (single key, IN-list, or bind variable). Queries without such a filter — including unfiltered LATEST ON … PARTITION BY sym, unfiltered SAMPLE BY, and unfiltered GROUP BY — fall back to a regular page-frame scan.
  • REINDEX on WAL tables requires dropping and re-adding the index (this applies to all index types, not just posting).
  • ALTER TABLE … ADD INDEX TYPE POSTING INCLUDE (col, …) can fail on very large partitions when at least one INCLUDE column is variable-width (STRING, VARCHAR, BINARY) and the seal phase would exceed RSS_MEM_LIMIT. Fixed-width INCLUDE columns stream through transparently. Remedies: drop the variable-width column from the INCLUDE list, reduce partition size, or raise RSS_MEM_LIMIT. See cairo.posting.index.indexer.spill.bytes.max.