QuestDB Enterprise 3.3.2 is a feature release on the QuestDB 9.4.x engine. The headline addition is support for Parquet as a full table format, letting an entire table — not just individual squashed partitions — live in Parquet. The release also ships native ARM64 (aarch64) Linux binaries, adopts the simplified single-version QWP wire protocol, hardens replication and backup against transient object-store failures and in-place Parquet regeneration, and pulls in a large batch of Parquet, posting-index, SQL-correctness and performance fixes from the OSS engine.
New Features
This feature replaces the previous fixed 8-slot entry-count cache for decoded Parquet row groups with a byte-budgeted LRU cache, controlled by the new
cairo.sql.parquet.cache.memory.sizeconfiguration property (default 256 MB). Each factory now declares whether it walks the base cursor monotonically or in scattered order via aParquetDecodeHint. Scattered-access paths (sorts, hash joins, cached windows) receive the full configured budget, while monotonic paths (ASOF/LT/SPLICEjoins,LATEST BY, scalar subqueries) cap at a quarter of the budget and at most 4 decoded row groups. The previouscairo.sql.parquet.frame.cache.capacityproperty is deprecated and no longer affects behavior. The cache uses an intrusive doubly-linked list with anIntObjHashMapindex for O(1) hit lookup and LRU promotion, and performs in-place victim reuse on eviction to avoidfree/munmapsyscalls on the hot path. Worst-case RSS equalseffectiveBudget × concurrent pools, and parallel joins open one decode pool per worker, each hinted monotonic. On a 502M-row trades table converted to Parquet,ORDER BYqueries withLIMITshowed 12–31% wall-time improvements on scattered-access paths.This feature logs an INFO line at a configurable interval with raw accounted memory, physical RSS, JVM heap usage, allocator counters, and the top 10 non-zero memory tags by absolute value. It reads
UnsafeandOsmemory accounting directly, so it works independently ofmetrics.enabled. Controlled by two hot-reloadable properties:memory.usage.log.enabled(defaulttrue) andmemory.usage.log.interval(default60s, max24h). Example log line:memory usage [mem.accounted=821301126, mem.rss.accounted=610057798, mem.non.rss.accounted=211243328, mem.rss.limit=86442891210, rss.physical=1100009472, jvm.heap.used=164455264, jvm.heap.committed=301989888, jvm.heap.max=24024973312, malloc.count=7945, realloc.count=8, free.count=199, tags=[NATIVE_ILP_RSS=520159232, MMAP_TX_LOG=170385408, ...]].This feature adds a SQL helper function,
is_end_of_month(timestamp), which returns true when the provided timestamp falls on the final calendar day of its month. This is useful for monthly reports, billing-period logic, finance workflows, and calendar-based time-series analysis. It handles leap-year February, non-leap-year February, 30-day months, 31-day months, and nanosecond timestamp inputs.
Improvements
This improvement applies two independent optimizations to the Parquet write path shared by
ALTER TABLE ... CONVERT PARTITION TO PARQUET,COPY ... TOParquet export, and the/expHTTP Parquet endpoint. First, symbol/dictionary statistics are now computed once per distinct dictionary key per page instead of once per row, using a reused per-column-chunk bitmap (one bit per key, up to 65,536 keys) that reduces comparisons from ~100k to ~1k per page for a 1000-symbol Zipf column with 100k-row pages. Second, the designated (monotonic) timestamp now defaults todelta_binary_packedencoding when no explicit encoding is given, significantly shrinking the timestamp column. On a 20M-row HOUR partition, conversion time dropped from 1.96s to 1.05s and Parquet file size decreased from 432 MB to 338 MB (a 47.2% reduction from the 640 MB native size). Thedelta_binary_packedencoding carries fixed per-block framing overhead that makes very small partitions (a handful of rows) slightly larger, but the net file-size win holds at scale. ExplicitPARQUET(...)encodings still take precedence over the new default.This improvement replaces the
LimitedSizeLongTreeChainwith an encoded-sort cursor for fixed-width sort keys (up to 32 bytes), eliminating random-access row-group decodes during the top-N build phase. The cursor collects encoded key and rowId pairs in a flat native buffer during the forward scan and sorts them natively. Once the buffer reachesmax(2 × limit, 4096)entries (clamped to the sort memory cap), it sorts, truncates to the top limit, and rejects subsequent rows via a single native compare against the boundary entry. During the build phase, the sort cursor declares its sort-key columns to the base cursor, allowing async-filtered scans to decode only the filter and sort-key columns instead of the full projection. During the emit phase, the cursor declares the exact row set to the base cursor, enabling Parquet frames to materialize only those rows via row filtering, with a density gate falling back to full decode when declared rows cover half or more of a row group. On a 50M-row order-book table with wide array columns, the customer query atLIMIT 100000improved from 236 ms to 185 ms warm and from 646 ms to 330 ms cold. TheEXPLAINoutput now prints "Encode sort light" instead of "Sort light". The legacy tree-chain cursor remains available viacairo.sql.orderby.sort.enabled=false. Two-boundNULLlimit semantics changed:LIMIT null, Nnow returns the first N rows andLIMIT N, nullreturns an empty result. Rows sharing the boundary sort key are now ordered by ascending rowId instead of reverse insertion order.This improvement enables equality predicates on a view or subquery that wraps a join to be propagated not only into the master table scan but also across equi-join keys into the slave table scans, turning full table scans into keyed lookups. Previously, filtering a view by a column of the join's master table pushed the predicate only to the master scan, while the equivalent inline query was already fully optimized. This fix covers
INNERandLEFTjoins, table and column aliases, and bind parameters as well as literal constants. For example, querying a view withWHERE entity_id = 'abc'now pushes the filter throughevents.entity_id = e.idinto both slaveeventsscans, producing index-forward scans onentity_idinstead of full table scans. Constants pinned to the slave side are still not propagated to the master, and aWHEREon a left-joined slave column still runs as a post-join filter.The parallel top-K factory (
ORDER BY ... LIMIT Nover page-frame tables) previously collected candidates into a per-worker tree structure driven by a compiled comparator, where every scanned row paid a tree insertion whose comparisons re-positioned the record across frames. This improvement routes encodable sort keys (fixed-width, byte-comparable keys up to 32 bytes) through flat per-worker encoded buffers of key-rowId entries instead. Each worker appends encoded entries and keeps its buffer at O(limit) through sort-and-truncate compaction; once a key threshold is known, non-qualifying rows drop with a native key compare. The owner merges the worker buffers through the same threshold filter, runs one final native sort, and emits rows by rowId. A batch-encoding fast path handles the common single fixed-width-8 column case, hoisting column base address, type dispatch, and direction transform out of the per-row loop. Per-worker encoders now share the owner's symbol rank maps instead of each independently sorting the whole symbol dictionary. Non-encodable keys (STRING/VARCHAR, non-static SYMBOL, keys over 32 bytes) keep the existing tree-chain path unchanged. In benchmarks on a 200M-row table, parquet partition queries improved from 380ms–200s (unstable) to a stable 19.76ms, and native partition queries improved from 29ms to 9.72ms. Peak native memory on the encoded parallel path is up to workerCount times the serial encoded cursor, though existing budget caps still apply.Previously,
SELECT ... FROM t LIMIT 10on a parquet partition decoded the entire first row group beforeLIMITdiscarded all but 10 rows, andLIMIT m, ndecoded the skip-landing row group in full before jumping to the skip position. This improvement pushes a max-rows hint from the limit cursor down to the parquet decoder so the decode window covers only the rows thatLIMITwill actually return. ForLIMIT m, n, the window starts at the skip landing row, so the skipped prefix is not decoded either. Decoded buffers stay frame-origin-addressable by shifting published column vectors back, so records and row cursors keep using absolute frame-relative row indexes with no read-side changes. A query that random-accesses a clamped frame viarecordAtre-decodes that frame to the full window, giving up the clamp for that frame in exchange for unconditional safety. In benchmarks on a table with 17 columns and 200,000 rows across two parquet row groups,SELECT * FROM t LIMIT 99990, 100010improved from 53.04ms to 8.91ms, andSELECT * FROM t LIMIT -10, -20improved from 26ms to 6ms. This fix also addresses a stale-pointer bug where partition frame cursors carried over a parquet metadata decoder reference from a prior call, which could crash or return wrong rows after reader eviction, and a clamp-gate bug where sorted symbol index scans could incorrectly read undecoded rows.This improvement introduces a light cached window factory (
CachedWindowLightRecordCursorFactory) that avoids materializing base columns into the record chain when the base cursor supports random access. Instead, it stores only deferred window output columns in a narrow chain plus a parallelDirectLongListof base row IDs, fetching base columns on demand viabaseCursor.recordAt. An encoded sort buffer packs sort keys and row IDs into a flatDirectLongListand finalizes withVect.sortEncodedEntries(parallelized above a configurable threshold), replacing the red-black tree sort for eligible groups. When the cursor knows the base row count, the buffer pre-sizes its backing memory to avoid geometric reallocations. The light path is controlled by thecairo.sql.window.cached.light.enabledconfiguration property (defaulttrue) and activates when the base cursor supports random access and every ordered group fits the encoded sort layout. Window queries with sort keys that do not fit the encoded layout (e.g.,VARCHAR ORDER BY) fall back to the original tree-based factory. Additionally,lead(ts, 0)andlead(m, 0)now correctly returnTIMESTAMPandDATEtypes respectively, matching the behavior of theirlagcounterparts instead of returningLONG.
Bug Fixes
This fix closes a gap in the object store client's retry coverage. While writes, deletes, copies, and streaming reads already had retry loops, whole-object reads (used by every
download_and_unpackcall site including restore's per-partition metadata, manifest, checkpoint history, and replication WAL index),statcalls, and lister creation were single-attempt operations. A transient S3 connection closure during a metadata download could abort a restore-on-startup and kill the server. The fix introduces a sharedretry_opbackingread,stat, andlistercreation with jittered backoff.NotFounderrors fail fast with their kind preserved so callers can map them toNone, TLS certificate-trust failures also fail fast since no retry can succeed, and everything else is retried within the configured budget. Exhaustion wraps the last error with anexceeded retry attemptsmarker for diagnostics.This fix addresses two latent bugs on the covering-index incremental seal path, both triggered by a pure-append out-of-order commit on a partition with a sealed posting index and more than 256 symbol keys. The first bug caused a crash after dropping an
INCLUDEcolumn:DROP COLUMNon anINCLUDEcolumn tombstones its cover slot, but the incremental seal path did not skip tombstoned slots like the full-seal path did, resulting in an assertion error on unsupported column type -1 that distressed the writer. The second bug was a silent native memory overrun:sealIncrementalsized the shared dirty-stride sidecar scratch at 8 bytes per merged value, but 16-byte types (UUID/LONG128/DECIMAL128) and 32-byte types (LONG256/DECIMAL256) wrote past the allocation through unboundedUnsafe.copyMemory. A hot symbol key with 4,001 rows and UUID+LONG256INCLUDEcolumns could write 93,984 bytes past the end of a 34,048-byte allocation. The fix makeswriteSidecarStrideDataskip tombstoned slots and sizes the dirty-stride sidecar scratch by the widest live fixed-size cover column.This fix addresses multiple defects where
TableWriter.rollback()on a table with a covering POSTING index left the index's covering sidecars inconsistent with its row-id data. The next out-of-order commit could fail with a spurious "No space left" error citing a multi-exabyte size and distress the writer, or silently corrupt covered reads. Covered queries between the rollback and the next seal also returned NULL values. The root cause was thatrollbackValues()viareencodeMonolithic()rewrote the row-id index into a fresh.pvat a bumped sealTxn but never wrote the.pccovering sidecars at that txn. The fix includes: rollback now returns early when no indexed value lies above the rollback point (the common case), avoiding unnecessary reencode work;reencodeMonolithic()rebuilds covering sidecars at the new sealTxn;seal()validates each cover's sidecar layout before snapshot copy and falls back to full seal on structural mismatch; incremental seal's clean-stride copy now bounds every offset and uses the stored stride-index sentinel as the upper bound instead of snapshot length, fixing a separate defect where post-seal gen flushes caused compounding sidecar growth; and rollback/truncate paths now sync the.pkchain publish before recording seal purge, closing a power-loss window where recovery could find a committed chain head pointing at deleted files.This fix corrects two defects where order-sensitive aggregates
first(),last(),first_not_null(), andlast_not_null()could return values from the wrong row in parallelGROUP BYqueries. The first defect affected per-row ordering incomputeNextoverDECIMAL,GEOHASH, andIPv4columns, where the returned value came from an arbitrary row in the group rather than the row with the smallest or largest position. The second defect affected shard merge forlast_not_nullover every value type: when a group's value was non-null in one worker's map but null in another,last_not_nullcould silently drop the real value and return null. Both defects were data- and timing-dependent, surfacing only when workers reduced page frames out of row order, which is more likely under HTTP/PostgreSQL Wire Protocol worker pools or work stealing. The fix ensurescomputeNextdecides the per-row winner by comparing row ids for the missed function families (Decimal, GeoHash, IPv4), and adds a stored-null check to everylast_not_nullmerge path across all 14 value types (Char, Date, Decimal, Double, Float, GeoHash, IPv4, Int, Long, Str, Symbol, Timestamp, Uuid, and Varchar), mirroring the guard thatfirst_not_nullmerges already carried.Concurrent Influx Line Protocol clients sending to WAL tables over TCP could silently corrupt each other's data, with values from one connection leaking into rows written by another. The corruption affected
DECIMALcolumns ingested from numeric or string values,LONG256columns, and binary-format values cast toSYMBOLcolumns. The server reported no error, so the problem only surfaced in query results. The root cause was thatLineTcpMeasurementSchedulershared oneLineWalAppenderacross all network IO workers, so concurrentappendToWalcalls raced on itsDecimal256,Long256Impl, andDirectUtf8Sinkscratch buffers. This fix ensures the scheduler keeps one appender per network IO worker, indexed by worker ID, restoring the one-appender-per-thread invariant that the HTTP path andQwpWalAppenderalready followed. Deployments with a single network IO worker or a single connection were unaffected, as were ingestion over HTTP and QWP over WebSocket.This fix addresses several classes of issues with Parquet file reading. Malformed or foreign-generated Parquet files could abort the entire database process through unrecoverable panics or infallible allocations sized by attacker-controlled counts in Thrift metadata, dictionary page headers, uncompressed page sizes, DELTA length streams, and array definition-level scratch buffers. All of these paths now use fallible allocation and return recoverable errors, keeping the server running. A separate correctness bug affected well-formed files including those QuestDB writes: partial range reads of multi-block DELTA-length-encoded
STRING/VARCHAR/BINARYcolumns computed the value-bytes offset incorrectly, returning shifted or corrupted data. This occurred when an interval predicate ended inside a row group with more than 128 values. The offset calculation now uses a structural walk over the block/miniblock layout for partial reads. Additionally,IntervalBwdPartitionFrameCursor.calculateSize()had an off-by-one that under-counted rows for unbounded-low interval predicates (ts < X) inORDER BY ts DESCqueries, potentially dropping row 0 and skipping entire partitions. The non-parallel read path also previously discarded specific error messages behind a generic "corrupted" string; it now preserves and surfaces the underlying cause. Allocation failures on these paths are classified asOutOfMemoryso that parquet merges underApplyWal2TableJobback off and retry rather than suspending the table.The WAL apply job's transaction skip calculation could walk past non-data transactions (structural metadata changes), allowing a skip decision to cross a DDL boundary. This meant two appliers of the same WAL stream (e.g. a replication primary and replica) could skip different transaction sets and reach a DDL point with different intermediate table content. When
ALTER COLUMN ... TYPE SYMBOLseeds the new column's symbol map from the rows present at apply time, different intermediate content on two appliers caused different keys to be assigned to the same strings, resulting in rows silently resolving to wrong symbol values. The TRUNCATE fast path had the same exposure. This fix ensures the future-transaction scan now stops at the first non-data transaction of any kind, so a replace-range can only justify skipping transactions within the same structural-change-free window. This guarantees every applier holds identical table content at each DDL point, restoring the invariant that symbol key assignment is independent of the WAL apply method. The skip optimization becomes more conservative: a data transaction whose replacer sits beyond an interleaved DDL/SQL/TRUNCATE transaction is now applied normally instead of skipped. Materialized view refresh and replace-range ingestion without interleaved DDL keep the previous skip behavior unchanged.The backward POSTING index reader keeps a per-key cache of generation-position entries that a full index walk resolves, and replays it on subsequent lookups of the same key. The cache-build guard keyed off an Elias-Fano mode flag, but certain early returns in the generation-load methods reset the block count without clearing this flag. When a walk passed through an Elias-Fano-encoded generation and then a lower generation that did not hold the key (reachable via split-block bloom filter false positives), the stale flag caused the guard to cache a spurious entry pointing at the lower generation but carrying the position from the EF generation. On the next lookup of the same key, replaying the spurious entry either returned row IDs belonging to a different key (wrong query results) or read a file offset far past the mapped value file and crashed the process with a SIGSEGV. This fix clears the Elias-Fano mode flag at the start of each generation-load method so a prior generation's mode cannot leak into the cache-add guard. The change does not affect the writer or the on-disk format, and indexes written before this fix are read correctly without requiring a rebuild. The forward reader was unaffected as it already reset the flag correctly.
Snapshot/checkpoint restore validates parquet partitions against
_txnbefore generating missing_pmsidecars. The committed-size check previously ran only when_pmwas absent: a restored_pmshort-circuited the partition before any validation. A snapshot could pair_txnwith a stale or truncateddata.parquetand a matching old_pm(for example when the parquet file is regenerated in place between the moments the two files were captured), and the restore then completed silently, leaving a partition that reads garbage at query time. This fix ensures the restore agent now opensdata.parquetand checks its length against the committed size from_txnfor every parquet partition, regardless of whether_pmexists. A snapshot whose parquet partition is shorter than the committed size now fails the restore with a diagnostic error, and a parquet partition missingdata.parquetnow fails the restore even when_pmexists. Restores that previously appeared to succeed with such partitions now fail loudly at restore time rather than at query time, and the restore can be retried against an intact snapshot. Each parquet partition restore performs one extra open/length call when_pmalready exists.A query nesting a join inside an
INsub-query failed at compile time with an internalNullPointerException. The root cause was thatSqlParserparsed a join'sONclause using a sharedExpressionTreeBuilderwithout raising an arg-stack floor, allowing theON-clause drain loop to consume operands from the outer expression. This left theINnode with a null left operand, which the code generator dereferenced and crashed on. This fix introducespushArgStackBottom()/popArgStackBottom()inExpressionTreeBuilder, andparseJoinnow brackets theON-clause parse with them, raising the arg-stack floor and blocking sub-queries for the duration. A secondary fix ensures that parse errors unwinding through a raised parse-stack bottom now surface the positionedSqlExceptioninstead of an internalIllegalStateException, by clamping each restored bottom to the current stack size. Additionally, declared variables are now expanded up front in both shorthandONbranches: a variable bound to a sub-query is consistently rejected with "query is not allowed here" across allON-clause positions, and a variable bound to a column correctly expands as a shorthand join column (e.g.,ON (@c)with@c := symbolbehaves likeON (symbol)).During session ID rotation,
SessionInfo.rotate()updated the coupledsessionIdandrotateAtfields in sequence, while the eviction sweep read them under a different lock. An eviction interleaving between the two writes observed the newsessionIdtogether with the stalerotateAt, computed an eviction time in the past, and dropped the old session ID immediately instead of keeping it valid through the grace period. A request still in flight on the old cookie was then rejected with HTTP 401, causing the user's Web Console session to appear to log out. This fix introduces aRotationInforecord so thatrotate()performs a single atomic write of bothsessionIdandrotateAt, and the sweep reads that record as one consistent snapshot.The optimiser incorrectly pushed
WHEREpredicates referencing only the master (left) table down into that table's sub-query forSPLICE,FULL OUTER, andRIGHT OUTERjoins. These join types NULL-extend the master side for unmatched right rows, but the pushed predicate left those NULL-master rows unfiltered, producing wrong results. This fix introducesmasterNullingJoinIndex()which reports the outermost downstream join that NULL-extends a given table.analyseEquals(),assignFilters(), andmoveWhereInsideSubQueries()now consult it and keep such predicates as post-join filters instead of pushing them into the table's sub-query. A master-side equality spanning two master tables (WHERE t0.a = t1.b) is similarly deferred to the post-join filter when a downstream NULL-extending join is present. Three secondary fixes accompany the optimiser change:EmptySymbolMapReader.keyOf(null)now returnsVALUE_IS_NULLinstead ofVALUE_NOT_FOUND,EmptyTableRandomRecordCursor.newSymbolTable()returns the shared immutableEmptySymbolMapReader.INSTANCE, andQueryModel.parseWhereClause()preserves each conjunct'sinnerPredicateorigin. Additionally, a pre-existing crash where aSPLICEjoin feeding a subsequent join leaked a stale master alias intocreateJoinMetadata()(triggering anAssertionErrorwith assertions enabled) is now resolved. The tradeoff is that master-side filters on these three join types can no longer use index or interval scans on the master table, running instead as a full master scan plus a post-join filter — the cost of correctness, as the previous plan produced wrong results.This fix addresses a native memory leak tagged as
NATIVE_INDEX_READERin the posting covering-index read path. When a row cursor was returned to the pool after its owning reader had already been closed, the cursor'sblockBufferAddr(a 512-byte allocation) was re-pooled into a reader whose free-cursor list would never be drained again, causing the buffer to leak. The fix guards the re-pool operation with the reader'sisOpen()state in both forward and backward readers' cursor close methods. When the owning reader is already closed, the cursor now releases its native buffers immediately rather than pooling into a dead reader, while preserving the block-buffer reuse optimization for the normal case when the reader is still open.This fix resolves a
ClassCastExceptionthat occurred when the SQL parser encountered a dotted name after an operator, such astables()/.env. The expression parser attempted to cast an operator token toGenericLexer.FloatingSequence, but operator tokens are plainStringinstances, causing an unchecked cast failure. This surfaced as an internal 500-class error instead of a clean 400 bad-query response. The fix guards the fast-path concatenation inExpressionParserso it only runs when the qualifier token is actually aFloatingSequence; otherwise it rejects cleanly with aSqlExceptionstating'.' is unexpected here. This handles all malformed<expr> / .namepatterns, not just the specific names used by vulnerability scanners.This fix ensures that catalogue functions (
tables(),all_tables(),information_schema.columns(),pg_catalog.pg_attribute) and point-lookup paths (SHOW COLUMNS,SHOW CREATE TABLE,SHOW CREATE MATERIALIZED VIEW, parquet row-group pruning) return complete results even when queried before the background metadata cache hydration finishes. Previously, the metadata cache was populated lazily on a background thread, so queries executed immediately after a restart or backup restore could observe an empty or partial table list. The fix addshydrateAllTables()for catalogue enumerators andhydrateTableOnDemand()for point-lookup paths, which reconcile the cache against the authoritative table registry on demand. A volatilecacheCompleteflag ensures these reconciliation calls short-circuit with no overhead once the cache is fully populated. A give-up budget of 8 consecutive zero-progress reconcile rounds prevents unbounded retries for genuinely unreadable tables. This is particularly important for PostgreSQL Wire Protocol clients, where drivers and BI tools automatically query column enumerators on connect — without this fix, the standardpg_classtopg_attributeintrospection join would render existing tables with zero columns during the hydration window.