QuestDB Releases

Latest updates and improvements to QuestDB Open Source and Enterprise editions.

Showing 20 releases

July 1, 2026

QuestDB Enterprise 3.3.3 is a stability-focused patch release on the QuestDB 9.4.x engine that concentrates on reliability and correctness: hardening replication and backup around snapshot restore and role changes, and pulling in a large batch of Parquet, posting/covering-index, storage and SQL-correctness fixes from the OSS engine. Alongside the fixes it adds two Enterprise capabilities — a hot in-place primary/replica role switch and finer-grained replication control (replication.disabled.tables plus rebase-aware replication).

New Features

This feature introduces two replication capabilities. First, replication.disabled.tables provides a reloadable, comma-separated list of tables to exclude from replication — both the uploader and downloader skip listed tables. Second, replication-aware handling of ALTER TABLE REBASE WAL ensures a rebased table is re-baselined correctly on replicas instead of being rebuilt from an empty state. The uploader detects a rebased table via a permanent _rebase_new marker, skips its empty seed transaction, and uploads from seqTxn 2, recording first_txn=2 in index.msgpack. A replica refuses to build such a table from the object store when it is missing the baseline below first_txn, prompting an operator to copy the table and resume. The previously separate replication status checks are consolidated into a single getReplicationStatus(dirName) callback returning ACTIVE, SUSPENDED, or DISABLED. An additional reloadable switch, replication.primary.pause.upload.on.suspended.tables (default false), allows pausing uploads of WAL-apply-suspended tables. For permissions, ALTER TABLE … SUSPEND WAL and RESUME WAL are both authorized by the single RESUME WAL permission, while REBASE WAL requires SYSTEM ADMIN privilege due to its destructive nature.
This feature adds controls for managing hard-suspended WAL tables and introduces ALTER TABLE <t> REBASE WAL to rebuild a table under a fresh sequencer. The cairo.wal.apply.suspended.tables configuration provides a reloadable, comma-separated list of table names that ApplyWal2TableJob skips, preventing WAL transaction application. When cairo.wal.apply.suspended.write.denied is set to true, writes to a hard-suspended table are rejected instead of being queued. Non-structural ALTER statements and FORCE DROP PARTITION on suspended tables are routed through a WAL-bypass path and applied directly via TableWriter, while structural changes remain denied. REBASE WAL clones the applied table into a new directory via hard links into a .rebase/ staging directory, resets _txn and _meta with a new tableId, seeds two empty transactions, swaps the name registry, and drops the old directory. Preconditions require the table to be a WAL table, hard-suspended, and cairo.wal.apply.suspended.write.denied=true. Rebasing a base table invalidates dependent materialized views (recoverable with REFRESH MATERIALIZED VIEW <v> FULL), while rebasing a materialized view itself re-registers it for a full refresh. Pending unapplied WAL transactions are discarded by a rebase. For permissions, SUSPEND WAL and RESUME WAL share the same authorization path, while REBASE WAL requires system-admin privilege due to its destructive nature.
This feature introduces opt-in memory limits that cap how much native memory a single bounded workload may allocate, throwing at the offending allocation site when the cap is crossed so a runaway is stopped at its source while unrelated workloads keep running. Three independent, dynamically reloadable configuration properties control the limits: cairo.query.memory.limit.bytes for user SQL queries, cairo.mat.view.refresh.memory.limit.bytes for materialized view refreshes, and cairo.wal.apply.memory.limit.bytes for WAL apply. All default to 0 (unlimited). On breach, the engine throws a CairoException with a distinct message identifying the workload type, query ID, limit, and memory tag. A MemoryTracker wraps a 16-byte native block shared with Rust, acquired per workload via QueryRegistry.register/unregister. Coverage includes map family, sort/tree chains, hash-join chains, FastGroupByAllocator and function state, LATEST BY rowid lists and maps, set-operation maps, encoded sort, secondary joins, window/horizon-join aggregations, SAMPLE BY fill, and Parquet decode buffers. The vectorized (Rosti) keyed GROUP BY hash tables remain on the global RSS counter only. The query_activity view gains memory_used and memory_limit columns. Tracker-aware pooled memory classes release their native backing on cursor close and re-allocate on next use to ensure each malloc and its matching free are charged to the same tracker.
This feature introduces statistical aggregate functions for computing kurtosis and skewness. The implementation uses Pébay's one-pass online algorithm (an extension of Welford's) for numerical stability, maintaining running mean and central moments (M2, M3, M4) per group. Partial aggregates merge with pairwise-combine formulas, enabling parallel GROUP BY execution. Sample skewness returns null for fewer than 3 values; sample kurtosis for fewer than 4. The sample variants apply Fisher's bias correction. Population variants return null for empty groups. All variants return null when every observation is equal (zero variance). Example usage:
```
SELECT skewness(price), kurtosis(price)
FROM trades
WHERE symbol = 'BTC-USD'
SAMPLE BY 1h;
```

Improvements

This improvement introduces an internal continuation runtime built on jdk.internal.vm.Continuation that allows SQL functions to yield their carrier worker thread while waiting, so the number of concurrent calls is no longer bounded by the worker pool size. The wait_wal_table(table_name [, seq_txn]) function blocks the current query until the WAL writer has applied transactions up to a target sequencer transaction number, returning true on success. It respects the SQL circuit breaker for query timeout, explicit CANCEL QUERY, and broken client connections. The sleep(seconds) function pauses the current query for a given duration (up to 24 hours) and returns a single TIMESTAMP row. Both functions release their worker carrier while parked. JDK ThreadLocal is replaced with a carrier-keyed CarrierLocal along continuation-critical paths to prevent stale thread identity across yield/resume boundaries. Two new configuration properties are available: cairo.timer.shards controls the number of daemon threads driving deadline-based wakeups, and griffin.query.continuation.wake.interval sets the millisecond interval for circuit breaker probing during park.
```
INSERT INTO trades VALUES (now(), 'AAPL', 100, 150.0);
SELECT wait_wal_table('trades');
```
This improvement extends the encoded radix sort path to every column type that ORDER BY accepts and to keys of any width, across both full-sort and serial/parallel top-K paths. Previously, only fixed-width columns fitting within 32 bytes used the fast radix sort; everything else fell back to a red-black tree sort with O(log n) comparator calls per row. The SortKeyEncoder now normalizes each sort column into a byte-comparable segment: VARCHAR uses shifted UTF-8 bytes with a terminator, STRING and non-static SYMBOL use UTF-16BE with escaping, and UUID/LONG256 use unsigned big-endian with an order-preserving null remap. Variable-width keys use a 16-byte inline prefix with overflow into a key heap, and the native kernel finishes tied partitions with pdqsort. Top-K builds entries in an EncodedTopKBuffer that rejects rows up front when their leading word is beyond the kept boundary, skipping full key encoding. Benchmarks on the ClickBench hits table show improvements from 52ms to 37ms at LIMIT 10 and from 100ms to 85ms at LIMIT 100000. The key heap is bounded by cairo.sql.sort.key.max.bytes, and the tree path remains available via cairo.sql.orderby.sort.enabled=false. This change also fixes a pre-existing bug where generateCastFunctions had no LONG128 case, causing UNION queries with both a SYMBOL column requiring casting and a LONG128 column to throw UnsupportedOperationException.
This improvement addresses persistent native memory consumption from cached parallel GROUP BY, top-K, and ASOF/window join factories. Previously, AsyncFilterContext.clear() freed page-frame memory pools but left per-worker DirectLongList row-id buffers at their peak size (up to 8 MB each at the default 1M row page frame), only freeing them on query-cache eviction. Now clear() calls resetCapacity() on the owner and per-worker lists, returning them to the 256-entry initial capacity. The tradeoff is one reallocation when a cached factory is reused for another large scan. This improvement also includes a safety fix for a native memory leak: DirectLongList.resetCapacity() could re-malloc a closed list when clear() was called after close(), which occurred in the four horizon-join factories on a failed getCursor(). The fix adds a guard to skip closed lists and reorders cursor/frameSequence cleanup in horizon-join factories to match the GROUP BY/TopK factories.
This improvement eliminates redundant per-row evaluation of subexpressions whose value is invariant across all rows of a cursor but not known at compile time. A typical example is a time-window threshold like dateadd('d', -30, to_timezone(now(), 'Asia/Kolkata')) inside a CASE branch or filter predicate, where now() is already cached but the surrounding function calls were re-evaluated on every row. A new RuntimeConstFunction wrapper evaluates the subtree once during init(), caches the result in a primitive field, and serves it from every getter. FunctionParser wraps only the maximal runtime-constant subtree at function boundaries, avoiding double-wrapping. Trivial runtime-constant leaves such as bind variables and now() that already cache their values are skipped to avoid unnecessary indirection. Only fixed-width scalar types (timestamp, long, int, double, date, boolean, ipv4, geo, uuid, etc.) are folded; variable-length types are left for future work. The wrapper delegates toPlan() to its argument, so EXPLAIN output is unchanged.
This improvement reduces redundant I/O during checkpoint and snapshot restore by collapsing per-partition Parquet metadata sidecar (_pm) processing from up to two maps and three CRC verifications down to a single map and single CRC per partition, all executed in parallel. Previously, the serial validation pass dominated restore startup on cold, object, or network storage. The recovery pool now submits one worker per thread, each pulling partitions from a shared cursor with dynamic load balancing and reusing native scratch objects across partitions. The committed-size truncation check on data.parquet runs inside the worker, removing the O(N) serial ff.length() loop. Additionally, the non-Parquet bitmap index rebuild now parallelizes across individual (partition, column) work items rather than whole partitions, so a non-partitioned table with several indexed symbol columns no longer serializes all index rebuilds onto one thread. A behavioral tradeoff is that fail-fast detection of truncated captures now occurs at the parallel drain point rather than strictly before sibling work starts, though the first failing worker trips a shared abort latch that short-circuits all other workers.

Bug Fixes

This fix addresses a bug where a primary snapshot could capture a Parquet partition whose on-disk data.parquet was a later generation than the committed Parquet size in _txn. An in-place O3 rewrite (partitionMutates=false) grows the file before _txn advances. When the replica restored that torn pair and replayed the O3 merge over the partition, the stale _pm caused the merge to decode a column chunk past the committed size, resulting in a "File out of specification: Column chunk range exceeds data length" error that suspended the replica table. This fix regenerates a partition's _pm during snapshot/checkpoint restore when the restored data.parquet is longer than the committed size.
This fix addresses an out-of-memory condition in the posting index seal and rollback paths on memory-constrained instances ingesting skewed symbol columns. The sealIncremental buffers were sized at DENSE_STRIDE (256) * preAllocPerKey, resulting in up to 256x over-allocation on skewed columns. This fix right-sizes the buffers to the actual dirty-stride aggregate, adds RSS pre-flights to both the seal and rollback paths, and streams every rollback cover shape per-key instead of decoding the whole index into one buffer. The fix also closes several latent safety issues including native out-of-bounds writes in unbounded posting decoders on corrupt or torn input, and a seal-purge reuse-race that could delete a live file. The tradeoff is that the incremental-seal pre-flight is conservative and may defer to the slower full seal when the incremental path would have just fit, and the streaming rollback re-derives each surviving key rather than holding the whole index, trading per-key CPU on the rare rollback path for a peak-memory bound.
A snapshot could capture a Parquet partition whose on-disk data.parquet was a later generation than the committed Parquet size recorded in _txn. The restore path trusted the partition's existing _pm sidecar in that state and left it stale, causing the first read or merge after restore to fail with a "Column chunk range exceeds data length" error. On a replica replaying the WAL over the restored partition, this suspended the table and broke replication. The root cause was that the validation only checked whether the sidecar could resolve a footer at the committed size, but a _pm from the later generation still resolved successfully despite being stale. This fix now only trusts an existing _pm when data.parquet is exactly the committed size. When the file is longer (indicating the snapshot captured the partition mid in-place rewrite), the _pm is regenerated from the committed size, which produces a correct compact sidecar for the committed footer.
This fix addresses a production out-of-memory incident on a memory-constrained instance ingesting a skewed symbol column. The incremental seal path (sealIncremental) sized its per-stride merge and trial buffers as if all 256 keys in a stride held the single hottest key's row count, inflating allocation by up to ~256x beyond actual data. The buffers are now sized to the actual aggregate of dirty strides computed from per-key counts, and a pre-flight defers to the full seal (which streams per key) when the correctly sized buffers would still breach the RSS limit. The rollback path previously decoded the entire partition index into one buffer before filtering, which was the exact memory-heavy case that triggered the incident. Rollback now streams per key for every cutoff, bounded by the largest single key rather than the whole index. All covering index shapes (fixed-size, var-size, and addr-based) now stream their sidecar rebuilds. Error handling was hardened across seal and rollback paths: failures after the value-file switch poison the writer so all mutating entry points reject further operations, while failures before the switch clean up staged files and restore writer state. Several latent safety bugs were also fixed, including unbounded native writes in posting decoders on corrupt input (now rejected with a clean error) and a seal-purge reuse race that could delete a live file after a sealTxn was freed and republished.
This fix addresses a remaining out-of-memory path in the posting index incremental seal that was not covered by the prior streaming-fallback RSS pre-flight. The incremental seal snapshots each covered column's entire sealed sidecar into native memory before the stride-merge pre-flight runs, so a skewed symbol with a covered (INCLUDE) column could still trigger an unguarded multi-gigabyte malloc that exceeded the global RSS limit during a WAL fast-lag commit. The fix adds the same headroom pre-flight before the snapshot copy, re-reading RSS usage per cover column to account for snapshots already taken. When the copy would not fit the live headroom, the incremental path is abandoned and the full seal is used instead, which rebuilds every sidecar by streaming per-key from the column files with peak memory bounded by the largest single key rather than the entire sidecar.
This fix resolves an intermittent NullPointerException in CoveringIndexRecordCursorFactory.getCursor that occurred when multiple queries ran concurrently, such as all panels of a Grafana dashboard firing at once over PostgreSQL Wire Protocol. The factory stored a direct reference to the compiler's pooled keyValueFuncs list from IntrinsicModel. Since SQL compilers are pooled and shared across threads, a second thread borrowing the same compiler would clear that shared list (via IntrinsicModel.clear()), nulling out slots while the first thread's factory was still reading them during getCursor(). The fix creates an owned copy of the list in the constructor, preserving the same Function instances which the factory already owns and frees on close. This restores the contract already honored by FilterOnValuesRecordCursorFactory, whose key-value list parameter is consumed at construction rather than retained.
Equality and IN filters on a renamed column of a Parquet partition could return too few rows (often zero), silently dropping valid data. The defect was in Parquet row-group bloom-filter pushdown, which resolved the filtered column against the Parquet file's column names. Those names are frozen when the partition is converted to Parquet, so a later RENAME COLUMN left them stale. When another column already carried the query's current name (common after a chain of renames), the pushdown landed on the wrong Parquet column, checked that column's bloom filter, got a false negative, and skipped the entire row group. This fix makes the native-table pushdown resolve columns by stable column id, the same way the data-decode path already does. read_parquet() cursors retain name-based resolution, which is the correct semantics for arbitrary external files without QuestDB column ids. Pruning effectiveness is unchanged for the common non-renamed case, and the bloom filter format, statistics, and decode paths are untouched.
Many query factories only consulted the execution context's circuit breaker inside their row-processing loop, which never runs for queries that produce no rows (empty tables, no-match filters, empty joins/aggregates), for instant single-row results, or for parallel paths that dispatch nothing when the input is empty. This fix ensures every affected factory consults the breaker at the point its work starts: at the top of hasNext() before the empty/advance guard, before build/aggregate loops, at cursor open for shared empty-cursor singletons, and before dispatchAndAwait() on parallel paths. A new time-throttled breaker variant always tests cancellation and timeout (both cheap, no syscall) and throttles only the heavy connection probe syscall by elapsed wall-clock time, defaulting to a 100 ms window. The throttle state lives on the breaker shared across the whole query, so a CROSS JOIN that re-scans the slave once per master row issues at most one probe per window for the entire query. Every query now pays for one real breaker check on its first consultation, including a single connection-probe syscall. Queries that previously ran to completion under an aggressive timeout can now abort as expected, including catalogue and admin listings.
A SAMPLE BY query using first() or last() aggregation with an indexed SYMBOL column filter and ALIGN TO FIRST OBSERVATION could return a bucket timestamp where an aggregated value was expected. The wrong results appeared only when the designated timestamp column was omitted from the SELECT list, and only for buckets in the middle of the result set — the first and last buckets were always correct. The root cause was that SampleByFirstLastRecordCursorFactory emitted middle rows through a data record that special-cased the bucket-timestamp column using the base-table timestamp index, while the column it was handed is a projection index. The two indexes coincide only when the timestamp is projected at the same position it occupies in the base scan. When the timestamp was not projected, the base index pointed at an unrelated projected column and overwrote it with the bucket timestamp. This fix makes the data record use the same projection index as the boundary record, so when the timestamp is not projected, no column matches the special case and every aggregate reports its stored value.
Subtracting a CHAR or STRING literal from a timestamp, such as SELECT * FROM t WHERE timestamp > now() - '1' LIMIT 1, failed with an internal error. The literal was implicitly bound to the LONG operand slot of the timestamp subtraction operator, but the factory read the right operand with getTimestamp() instead of getLong(). Calling getTimestamp() on a CharFunction throws UnsupportedOperationException, which surfaced as an internal error. This fix changes SubTimestampFunctionFactory to read the right operand with getLong(), mirroring the symmetric AddLongToTimestampFunctionFactory which already did this correctly. As a result, - and + now produce identical, well-typed cast errors for non-numeric literals. For example, now() - '3 day' now reports inconvertible value: '3 day' [STRING -> LONG] instead of the previous [STRING -> TIMESTAMP].
Two malformed expressions involving CASE ... END crashed with internal errors instead of returning syntax errors. A binary operator with a missing right operand directly after CASE ... END (e.g., select sum(case when true then 1 else 0 end & )) caused a NullPointerException because the expression parser's END keyword handler left a stale depth counter that defeated the arity guard, allowing a tree node with a null left-hand side to be built. This fix resets the depth counter to 0 after the flush loop so the arity guard fires correctly, producing the error too few arguments for '&' [found=1,expected=2]. A dot directly after CASE ... END (e.g., select case when true then 1 else 0 end.foo) caused either a NullPointerException or ClassCastException because the dot handler assumed a gluable literal token on the operator stack. This fix extends the token only when the stack top is a literal; otherwise it raises the syntax error '.' is unexpected here. Valid CASE expressions and qualified table.column references are unaffected by both fixes.
This fix addresses a storage-engine corruption that occurred when a WAL REPLACE_RANGE commit in O3 mode appended brand-new partitions above the table's previous last partition. The root cause was in TableWriter.processO3Block, where replace mode derived partitionTimestampHi from o3TimestampMax (the replace-range high boundary) rather than the highest timestamp actually written. For an open-ended range, the boundary is Long.MAX_VALUE - 1, causing getCurrentPartitionMaxTimestamp to overflow. As a result, partitionTimestampHi stayed at the stale pre-commit ceiling, finishO3Commit skipped switching the writer's active column files to the new last partition, and the next commit reused the previous partition's column descriptors, producing rows dated below the new partition's floor or suspending the table. This fix derives partitionTimestampHi from the actual highest written timestamp (replaceMaxTimestamp) in replace mode, leaving the non-replace path unchanged.
This fix resolves a data-ordering bug where a non-WAL writer ingesting out-of-order data could produce out-of-order rows when the in-order prefix crossed a partition boundary. The partition switch sealed the earlier partition in memory but did not persist _txn. On the next lag commit, o3MoveUncommitted reclaimed the active partition's uncommitted in-order tail into the O3 buffer and reset maxTimestamp to the durable _txn value, which predated the switch. The high-water mark ended up below committed data, causing a rollback to reload it and the next O3 commit to reorder the partition tail. This fix ensures maxTimestamp resets to the max timestamp of the last sealed partition by introducing a new variable that is updated whenever a partition switch happens.
This fix addresses a data-ordering bug in the non-WAL out-of-order (O3) commit path where a lag commit (TableWriter.ic()) could leave the table's maxTimestamp below the maximum timestamp actually committed to disk. When uncommitted rows of a lag commit spanned more than the active partition, o3MoveUncommitted() pulled only the active partition's rows into the sorted O3 batch and emptied it. The in-order rows left on disk in the previous partition stayed committed but were not part of the sorted O3 batch, so maxTimestamp was recomputed solely from the O3 batch boundary and could end up lower than the true on-disk maximum. A later O3 commit then merged against that stale boundary and left a single timestamp inversion. This fix reads the earlier partition's actual max timestamp via readPartitionMinMaxTimestamps() and folds it into the committed maxTimestamp computation when the active partition is emptied while uncommitted rows remain in an earlier partition.
This fix resolves a non-deterministic metadata reload failure where MemoryCMRImpl.of() read errno after a cleanup close() call when a file's length could not be read. Since errno is thread-local global state, the intervening close() could overwrite the errno left by the failed length() call. The exception then carried the cleanup call's errno instead of the real one. This mattered because metadata reload classifies failures by errno: TableUtils.handleMetadataLoadException() retries the read only while CairoException.isFileCannotRead() is true. With a clobbered errno, a transient "file does not exist" condition — such as a _meta file briefly unreadable during a concurrent metadata change — was misclassified as fatal, surfacing a "could not get length" error instead of retrying. This fix captures errno immediately after the failed length() call, before close() runs.
This fix encompasses a broad set of query engine hardening changes. Key fixes include: SAMPLE BY FILL(LINEAR) cleanup no longer crashes with NullPointerException on out-of-memory; covering posting index readers now propagate sidecar I/O errors instead of swallowing them; IntervalBwdPartitionFrameCursor.calculateSize() now returns correct counts for backward interval scans with open lower bounds; DATE-argument window value functions (max, min, first_value, last_value, nth_value, lag, lead) no longer throw UnsupportedOperationException; rank() / dense_rank() no longer crash when pass-through columns of unserializable types (UUID, STRING, VARCHAR, LONG256, arrays) are projected; lead() over high-precision DECIMAL no longer crashes; window RANGE-frame functions now return correct values when the designated timestamp is not in the projection; sum() / avg() over Decimal256 now produce correct results over sliding frames by using scale-agnostic subtraction for eviction; JIT filter now correctly widens nested INT products to 64 bits when a FLOAT operand is present in the predicate; LimitedSizeLongTreeChain cleanup no longer crashes on partial construction failure; keyed ASOF JOIN sink-heap no longer leaks on out-of-memory during cursor open; vectorized rosti GROUP BY no longer over-frees NATIVE_ROSTI memory when wrapUp() resizes the map; stale aggregate tasks in the shared vector-aggregate queue no longer crash later queries after a fault aborts a cursor drain; GROUP BY keys referencing aliases of trivial arithmetic expressions now compile correctly; multi-key index scan no longer leaks posting-index block buffers on out-of-memory; and window functions over untyped null literals now raise a clean SqlException suggesting a concrete cast instead of behaving non-deterministically across platforms.
This fix addresses two distinct faults where per-partition code paths iterating posting columns were guarded only against the absent case (columnTop == -1), not the row-less case (columnTop >= partition size). In the first fault, restorePostingIndexersToLastPartition() would crash with "index does not exist" when encountering a row-less column with no .pk key file, causing WAL table suspension or INSERT failures on BYPASS WAL tables. The fix discriminates on the key file's actual presence rather than row-less-ness, correctly re-pointing indexers for row-less columns that do have a .pk (created via ADD COLUMN ... INDEX TYPE POSTING on the active partition) while skipping those without one. In the second fault, linkPartitionIndexFiles() would crash with "index files do not exist" when switching a partition to parquet and encountering a row-less column without a key file. The fix aligns the guard with its sibling copyOrRebuildColumnIndexes() to skip row-less columns on historic partitions where a .pk is never built. The row-less-without-.pk state arises from multi-partition, same-transaction out-of-order writes interleaved with ADD COLUMN. Without the first fix, a blanket row-less skip would silently strand indexers, making rows invisible to indexed predicates with no exception or suspension. The accessor was also changed from getColumnTopQuick to getColumnTop to correctly return -1 for absent columns.
This fix resolves a production bug where a query cancellation could be silently dropped, and the affected query's per-query timeout defeated along with it, when the cancel races query startup. The root cause was that NetworkSqlExecutionCircuitBreaker.cancel() always sets a powerUpTime == Long.MIN_VALUE sentinel but only flips the per-query cancelled flag when that flag is already attached. QueryRegistry.register() publishes the query and fires its listener before binding the per-query flag, so a CancelRequest landing in that window sets only the sentinel. The subsequent register() call then binds a fresh flag with value false, and neither testTimeout() nor testCancelled() consulted the sentinel — testTimeout() would compute an overflowed negative runtime that never trips, and testCancelled() only checked the flag. The query would run to natural completion ignoring both cancel and timeout, pinning workers. The fix makes testCancelled() check the sentinel first so both stateful paths abort a racing cancel, updates getState() to classify the sentinel as STATE_CANCELLED instead of mislabelling it STATE_TIMEOUT, and adds a clearCancelSentinel() call in PGConnectionContext.prepareForNewQuery() to bound the sentinel to a single query and prevent leaking across requests.
This fix addresses several issues where HTTP parser and response objects were not fully resetting their per-request state when connections and contexts were reused across requests. A reused context could hit closed native parsing or compression buffers and crash the server when handling otherwise valid requests. Specifically, the fix reopens the header parser's quoted-value sink when pooled HTTP contexts are reused, clears per-request gzip negotiation state in HttpResponseSink so clients only receive gzip responses when the current request advertises gzip support, clears charset and mapped cookie state in HttpHeaderParser.clear(), fixes multipart false-boundary replay so body bytes are preserved when a boundary-like sequence is followed by non-boundary suffixes, keeps multipart resume pointers anchored to the receive buffer during retry paths, parses Content-Disposition parameters with quote-aware delimiter handling including semicolons and equals signs inside quoted values, and resets Content-Disposition parameter state consistently after known and unknown parameters. This allows uploaded filenames such as a;b.csv and a=b;c.csv to parse correctly.
This fix addresses numerical-stability bugs in the corr() function where the Pearson denominator sqrt(sumXX * sumYY) can overflow or underflow while computing the product of two sums of squared deviations. With large-magnitude inputs (values near +/-1e153), each sum is finite (~1e306) but their product overflows to +Infinity, causing the final division to return 0.0 instead of the true correlation. With small-magnitude inputs (values near +/-1e-150), each sum is finite (~1e-300) but their product underflows to 0.0, causing the division to return NaN. The fix prefers the single-rounding sqrt(a * b) denominator when the product is finite and non-zero, preserving existing bit-exact behavior for normal inputs, and falls back to sqrt(a) * sqrt(b) when the product would overflow or underflow while both factors are non-zero. The final Pearson result is clamped to [-1, 1] to absorb small rounding drift in the fallback path. This applies to CorrGroupByFunctionFactory, AbstractBivariateStatWindowFunctionFactory.computeCorr, and AbstractBivariateStatWindowFunctionFactory.computeCorrWelford. Normal-magnitude inputs are unaffected and keep their prior bit-exact results.
This fix addresses a production issue where a materialized view's incremental refresh could be permanently invalidated by transient errors such as reader pool exhaustion (EntryUnavailableException) or out-of-memory conditions, cascading invalidation up dependent view chains. On transient "table busy" or OOM errors, the refresh now schedules a per-view backoff deadline and returns without invalidating, with MatViewTimerJob re-enqueuing the refresh once the backoff elapses. A consecutive retry counter caps retries at a configurable limit (default 10 via cairo.mat.view.refresh.busy.retry.limit), after which the view is invalidated to bound WAL retention. The materialized_views.view_status column now returns retrying while a view is in a transient-refresh backoff window. A new cairo.mat.view.refresh.block.list configuration option accepts a comma-separated list of materialized view names that the refresh job must never refresh, serving as an escape hatch for views whose refresh crashes or destabilizes the database. Listed views are skipped by all refresh paths without being invalidated. The OOM path no longer calls Os.sleep between step-halving attempts, preventing the single refresh worker from being blocked. Genuine errors such as bad SQL, type mismatches, or dropped base tables still invalidate immediately. Configuration options include cairo.mat.view.refresh.busy.retry.timeout (default 1000ms backoff between retries) and cairo.mat.view.refresh.busy.retry.limit (default 10 consecutive retries before invalidating). A blocked view grows stale and can pin base-table WAL retention until dropped or unblocked.

June 15, 2026

QuestDB 9.4.3 brings some key bug-fixes, along with parquet-native tables and new parquet-querying performance enhancements, with order-of-magnitude speedups for ORDER BY ... LIMIT queries.

New Features

This feature introduces a SQL helper function, is_end_of_month(timestamp), which returns true when the provided timestamp falls on the final calendar day of its month. It is useful for monthly reports, billing-period logic, finance workflows, and calendar-based time-series analysis. The function handles leap-year and non-leap-year February, 30-day months, 31-day months, and timestamp_ns inputs.
This feature logs an INFO line on a configurable interval with raw accounted memory, physical RSS, JVM heap usage, allocator counters, and the top 10 non-zero memory tags by absolute value. It reads Unsafe and Os memory accounting directly, so it works independently of metrics.enabled. Controlled by memory.usage.log.enabled (default true) and memory.usage.log.interval (default 60s, max 24h). Both properties support hot-reload. Example log line: memory usage [mem.accounted=821301126, mem.rss.accounted=610057798, mem.non.rss.accounted=211243328, mem.rss.limit=86442891210, rss.physical=1100009472, jvm.heap.used=164455264, ...].
This feature replaces the previous fixed 8-slot entry-count limit for decoded Parquet row groups with a byte-budgeted LRU cache, and lets factories declare their access pattern as MONOTONIC or SCATTERED. The byte budget makes the cache footprint explicit and tunable, while the access-pattern hint allows forward-mostly cursors to run on a smaller share of the budget while out-of-order cursors retain the full budget. Configured via cairo.sql.parquet.cache.memory.size (default 256 MB per cursor). The previous cairo.sql.parquet.frame.cache.capacity property is deprecated and no longer affects behavior. SCATTERED-hinted paths (sorts, hash joins, windows) receive the full budget and a much larger effective cache than the old 8 slots, while MONOTONIC-hinted paths (AsOf/Lt/Splice joins, latest by, scalar subqueries) cap at 4 decoded row groups at a quarter of the configured budget. Internal improvements include in-place victim reuse on cache miss to eliminate close()/reopen() round-trips and associated syscalls, a bounded shell pool to avoid wrapper-object churn, skipping openParquet() on cache hits, true O(1) LRU hit promotion via an intrusive doubly-linked list, and VARCHAR_SLICE decode footprint accounting.

Improvements

This improvement replaces the tree-chain comparator with a flat native buffer of encoded (key, rowId) pairs for fixed-width sort keys up to 32 bytes. During the build phase, the encoded-sort cursor collects entries and keeps the buffer at O(limit) through sort-and-truncate compaction, eliminating the O(K log K) random-access decodes per query that the tree chain required. The build phase also narrows decoded columns on async-filtered scans to only the filter and sort-key columns, and the emit phase decodes only the rows being returned via a row filter, with a density gate falling back to full decode when declared rows cover half or more of a row group. On a 50M-row Parquet partition with wide array columns, warm-cache performance improved from 236 ms to 185 ms and cold-cache from 646 ms to 330 ms. Build-phase improvements range from 3-6x across various LIMIT values, with the patch eliminating a decode-cache thrash cliff that caused master to degrade from 292 ms to 48.84 s at LIMIT 850,000. The EXPLAIN output now prints "Encode sort light" instead of "Sort light" for the encoded path. Two-bound NULL limit semantics changed: LIMIT null, N now returns the first N rows and LIMIT N, null returns an empty result, aligning with single-bound LIMIT N behavior.
This improvement routes encodable sort keys (fixed-width, byte-comparable keys up to 32 bytes) through flat per-worker EncodedTopKBuffers instead of per-worker LimitedSizeLongTreeChain structures in parallel top-K plans. Each worker appends encoded entries and keeps its buffer at O(limit) through sort-and-truncate compaction, while the owner merges worker buffers through a threshold filter and runs one final native sort. A single-column batch encoder (encodeFixed8Frame) hoists type dispatch and direction transforms out of the per-row loop for the common ORDER BY col LIMIT N shape, and per-worker encoders now share the owner's symbol rank maps instead of each sorting the full dictionary independently. On a 200M-row Parquet partition with wide array columns, the target query improved from 380 ms–200 s (unstable) on master to a stable 19.76 ms, and on native partitions from 29 ms to 9.72 ms. Non-encodable keys (STRING/VARCHAR, non-static SYMBOL, keys over 32 bytes) continue using the existing tree-chain path. Peak native memory on the encoded parallel path scales up to workerCount times the serial buffer size, bounded by cairo.sql.sort.key.max.bytes and cairo.sql.sort.light.value.max.bytes per worker.
This improvement pushes a row-count hint from LimitRecordCursor down to the Parquet decoder so the decode window covers only the rows that LIMIT will actually return. For LIMIT m, n, the window starts at the skip landing row, so the skipped prefix is not decoded either. When the row cursor is a forward 1:1 scan with no filter, the Parquet decode range is clamped via a new PageFrameMemoryPool.navigateTo overload that takes an in-frame row window. Decoded buffers remain frame-origin-addressable through column vector remapping. Queries like SELECT * FROM t LIMIT 99990, 100010 on a 200K-row Parquet partition improved from ~53ms to ~9ms, and negative limits saw similar gains. This also fixes a stale-pointer bug where FullFwdPartitionFrameCursor.next(skipTarget) and FullBwdPartitionFrameCursor.next(skipTarget) carried over a parquetMetaDecoder reference from a prior call, producing a use-after-free that could crash or return wrong rows after reader eviction. A second bug was fixed where SortedSymbolIndexRowCursorFactory could allow scattered symbol-index scans through the clamp gate, reading undecoded rows. The optimization does not apply to filtered queries (WHERE ... LIMIT N), cross joins, or native (non-Parquet) partitions.
This improvement enables equality predicates on a view or subquery that wraps a join to be pushed down not only into the master table scan but also propagated across equi-join keys into the slave scans, turning full table scans into keyed lookups. Previously, filtering a view by a column of the join's master table only reached the master scan, while the same query written inline was already optimized. For example, SELECT * FROM entity_stats WHERE entity_id = 'abc' on a view containing LEFT JOIN events ON events.entity_id = e.id now pushes entity_id = 'abc' into both events scans via index forward scans. This covers INNER and LEFT joins, table/column aliases, and bind parameters as well as literal constants. Constants pinned to the slave side are still not propagated to the master, and a WHERE on a left-joined slave column still runs as a post-join filter.
This improvement includes two changes to the Parquet write path shared by ALTER TABLE ... CONVERT PARTITION TO PARQUET, COPY ... TO Parquet export, and the /exp HTTP Parquet endpoint. First, symbol/dictionary statistics are now computed once per distinct dictionary key per page instead of once per row, using a reused per-column-chunk bitmap (one bit per key) for dictionaries up to 65,536 keys. This reduced the CPU time spent in BinaryMaxMinStats::update from 34.6% to ~0.1% of conversion CPU. Second, the designated (monotonic) timestamp now defaults to delta_binary_packed encoding when no explicit encoding is given, significantly shrinking the timestamp column. On a 20M-row partition with 17 columns, conversion time dropped from 1.96s to 1.05s and Parquet file size decreased from 432MB to 338MB (47.2% reduction vs native). Explicit PARQUET(...) encodings still take precedence over the new default. For very small partitions (a few rows), delta_binary_packed produces slightly larger files due to fixed per-block framing overhead.

Bug Fixes

This fix resolves a ClassCastException that occurred when the SQL expression parser encountered a dotted name (e.g., .env) after an operator (e.g., /). Queries like tables()/.env or select 1/.env from long_sequence(1) — commonly sent by vulnerability scanners probing endpoints like /exec?query=... — caused an unchecked cast of a plain String operator token to GenericLexer.FloatingSequence, resulting in an internal 500-class error with a stack trace instead of a clean 400 bad-query response. The fix guards the fast-path concat in ExpressionParser so it only runs when the qualifier token is actually a FloatingSequence; otherwise it rejects cleanly with a SqlException: '.' is unexpected here message. This handles every malformed <expr> / .name shape, not just the specific names used by scanners.
This fix resolves a native-memory leak of 512 bytes per occurrence (NATIVE_INDEX_READER tag) in the POSTING covering-index read path. PostingIndexBwdReader and PostingIndexFwdReader pool idle row cursors and deliberately retain each cursor's blockBufferAddr allocation for reuse. If a cursor returned to the pool after its owning reader was already closed — for example, when a reader-thread cursor outlived a concurrent reseal/reload — the cursor re-pooled into a reader whose free-cursor list would never be drained again, leaking the 512-byte block buffer. The fix guards the re-pool with the reader's isOpen() state in both readers' Cursor.close() and NullCursor.close() methods. When the owning reader is already closed, the cursor releases its native buffers immediately rather than pooling into a dead reader. The block-buffer reuse optimization is preserved for the normal case when the reader is still open.
The SQL optimizer incorrectly pushed WHERE predicates referencing only the master (left) table down into that table's sub-query for SPLICE, FULL OUTER, and RIGHT OUTER joins. These join types NULL-extend the master side for unmatched rows, so a pushed-down predicate left those NULL-master rows unfiltered, producing wrong results. The fix introduces masterNullingJoinIndex() to detect downstream NULL-extending joins and keeps such predicates as post-join filters via addPostJoinWhereClause. This applies across assignFilters(), analyseEquals(), and moveWhereInsideSubQueries(), covering single-table, multi-table, and barrier-joined branches. A master-side equality spanning two master tables is similarly deferred when a downstream NULL-extending join is present. Additionally, a pre-existing crash was fixed where a SPLICE join feeding a subsequent join leaked a stale master alias, causing an AssertionError in createJoinMetadata(). Because the predicate now runs after the join, master-side filters on these three join types can no longer use index or interval scans on the master table, which is the cost of correctness — the previous plan produced wrong results.
During session ID rotation, SessionInfo.rotate() updated the coupled sessionId and rotateAt fields in sequence, while the eviction sweep read them under a different lock. An eviction interleaving between the two writes observed the new sessionId together with the stale rotateAt, computed an eviction time in the past, and dropped the old session ID immediately instead of keeping it valid through the grace period. Requests still in flight on the old cookie were then rejected with HTTP 401, causing the user's Web Console session to appear to log out. This fix introduces a RotationInfo record so that rotate() performs a single atomic write of both sessionId and rotateAt, and the eviction sweep reads them as one consistent snapshot.
A query nesting a join inside an IN sub-query failed at compile time with an internal NullPointerException because the ON-clause parser never raised an argument-stack floor for the shared ExpressionTreeBuilder. Without the floor, the drain loop consumed operands belonging to the enclosing expression, leaving the IN node with a null left operand. The fix brackets the ON-clause parse with pushArgStackBottom() / popArgStackBottom(), isolating its operands and blocking sub-queries within the ON clause (which are unsupported and now consistently reject with "query is not allowed here"). A secondary fix clamps restored stack bottoms to the current stack size during error unwinding, so a parse error nested two or more lambda levels deep surfaces its positioned SqlException instead of a masking internal IllegalStateException. Additionally, parseJoin now expands declared variables up front in shorthand ON branches, so a variable bound to a column (e.g., @c := symbol used as ON (@c)) works as a shorthand join column, and a variable bound to a sub-query is consistently rejected across all ON-clause positions.
Snapshot restore validates Parquet partitions against _txn before generating missing _pm sidecars, but the committed-size check previously ran only when _pm was absent. A restored _pm short-circuited the partition before any validation, so a snapshot pairing _txn with a stale or truncated data.parquet and a matching old _pm completed silently, leaving a partition that reads garbage at query time. The restore agent now opens data.parquet and checks its length against the committed size from _txn for every Parquet partition regardless of whether _pm exists. Restores that previously appeared to succeed with such partitions now fail loudly with the existing "restored parquet file is shorter than committed size" diagnostic, surfacing the problem at restore time rather than at query time. Each Parquet partition restore performs one extra open/length call when _pm already exists.
The backward POSTING index reader maintains a per-key cache of (generation, position) entries resolved during a full index walk. The cache-build guard keyed off isEFMode, but early returns in loadSparseGenByPrefixSum for generations holding no values for the queried key reset isFlatMode and the block count without clearing isEFMode. When a walk passed through an Elias-Fano-encoded generation (setting isEFMode) followed by a lower generation that did not hold the key but bloom-filter false-positived it, the stale isEFMode caused the guard to cache a spurious entry pointing at the lower generation with a position from the EF generation. Replaying this entry on subsequent lookups either returned row IDs belonging to a different key (wrong query results) or read a file offset past the mapped value file, crashing the process with a SIGSEGV. This fix clears isEFMode at the start of each generation-load method (loadSparseGenByPrefixSum, loadSparseGenDirect, loadDenseGenerationCached) so a prior generation's mode cannot leak into the cache-add guard. The forward reader is unaffected as it already resets the flag. No index rebuild is required.
The WAL transaction skip optimization could cross DDL boundaries when calculating which transactions to skip, causing different appliers (e.g., replication primary and replica) to reach a DDL point with different intermediate table content. When ALTER COLUMN ... TYPE SYMBOL seeds the new column's symbol map from rows present at apply time, different intermediate content caused different key-to-string assignments, resulting in silent symbol value divergence. The future-transaction scan now stops at the first non-data transaction of any kind, ensuring every applier holds identical table content at each DDL point. The TRUNCATE early exit is now only reachable when the entire window before it consists of data transactions. This fix makes the skip optimization more conservative: a data transaction whose replacer sits beyond an interleaved DDL/SQL/TRUNCATE transaction is now applied normally instead of skipped.
This fix addresses multiple issues in the Parquet reader. Several malformed or foreign-generated Parquet inputs could abort the entire database process via unrecoverable panics or infallible allocations sized by attacker-controlled counts. Thrift metadata list reservations, RLE_DICTIONARY bit widths exceeding the target width, dictionary page num_values overrunning the buffer, unbounded decompression buffer allocations, foreign array definition-level scratch buffers, and DELTA length streams sized by attacker-controlled counts all now use fallible allocation and return recoverable errors instead of aborting. A correctness bug affecting well-formed files was also fixed: partial range reads of multi-block DELTA-length-encoded STRING/VARCHAR/BINARY columns computed the value-bytes offset incorrectly, returning shifted or corrupted values. The offset calculation now uses a structural walk over the block/miniblock layout when the length decoder was not exhausted. Additionally, IntervalBwdPartitionFrameCursor.calculateSize() had an off-by-one that under-counted rows for unbounded-low interval predicates (ts < X) in backward scans, potentially dropping whole partitions from count() and LIMIT results. The non-parallel read path now also preserves the specific guard error message instead of masking it behind a generic "corrupted" string. Allocation failures on these paths are classified as OutOfMemory so that write-path parquet merges back off and retry rather than suspending the table.
Concurrent Influx Line Protocol clients sending to WAL tables over TCP could silently corrupt each other's data, with values from one connection leaking into rows written by another. The corruption affected DECIMAL columns ingested from numeric or string values, LONG256 columns, and binary-format values cast to SYMBOL columns. The root cause was that LineTcpMeasurementScheduler shared one LineWalAppender across all network IO workers, so concurrent appendToWal calls raced on its Decimal256, Long256Impl, and DirectUtf8Sink scratch buffers. The scheduler now keeps one appender per network IO worker, indexed by worker ID. Deployments with a single network IO worker or a single connection were unaffected, as were ingestion over HTTP and QWP over WebSocket.
In a parallel GROUP BY, the order-sensitive aggregates first(), last(), first_not_null(), and last_not_null() could return a value from the wrong row. Two distinct defects were involved. First, per-row ordering over DECIMAL, GEOHASH, and IPv4 columns returned values from arbitrary rows in the group rather than the row with the smallest or largest position, because a prior change that added row-id comparison guards missed these types. Second, the last_not_null() shard merge for every value type could silently drop a real value and return null when a group's value was non-null in one worker's map but null in another with a higher row ID. Both defects were data- and timing-dependent, surfacing only when workers reduced page frames out of row order. The computeNext ordering now correctly compares row IDs for all affected types, and every last_not_null() merge now accepts the source value when the destination slot holds null, matching the guard that first_not_null() already had.
A TableWriter.rollback() on a table with a covering POSTING index left the index's covering sidecars inconsistent with its row-id data. Depending on the data that followed, the next O3 commit either failed with a spurious CairoException: No space left and a DISTRESSED writer, or silently corrupted covered reads. Covered queries issued between the rollback and the next seal also returned null values for the covered columns. The root cause was that reencodeMonolithic() rewrote the row-id index into a fresh .pv at a bumped sealTxn but never wrote the .pc covering sidecars at that transaction. This fix makes rollbackValues() return early when no indexed value lies above the rollback point, and reencodeMonolithic() now rebuilds the covering sidecars at the new sealTxn. The seal() method now validates each cover's sidecar layout before paying the snapshot copy and falls back to a full seal on structural mismatch. The incremental seal's clean-stride copy now bounds every offset it reads and uses the stored stride-index sentinel as the last stride's upper bound, fixing a secondary defect where post-seal gen flushes compounded sidecar growth on every following incremental seal. Additionally, rollback and truncate paths now sync the .pk chain publish before recording the seal purge, closing a window where a power loss could leave a committed chain head pointing at deleted files.
This fix addresses two latent bugs on the covering-index incremental seal path, both triggered by a pure-append out-of-order commit on a partition with a sealed posting index and more than 256 symbol keys, with the batch touching only some strides. The first bug caused a crash after dropping an INCLUDE column that a covering index references — the tombstoned slot routed into the fixed-stride writer, which rejected the sentinel type -1, distressing the writer. The second bug was a silent native-memory overrun when wide INCLUDE columns (UUID, LONG128, LONG256, etc.) were present — the incremental seal path sized its scratch buffer assuming 8 bytes per value, but 16- or 32-byte types could write far past the allocation through unbounded Unsafe.copyMemory. The fix ensures writeSidecarStrideData now skips tombstoned slots (matching the full-seal path behavior), and sealIncremental sizes the dirty-stride sidecar scratch by the widest live fixed-size cover column rather than assuming 8 bytes. Covered reads of a dropped INCLUDE column continue to return NULL as designed. The scratch allocation grows proportionally to the widest cover column but remains transient, freed at the end of the seal.

June 9, 2026

QuestDB Enterprise 3.3.1 is a maintenance release on the same QuestDB 9.4.x engine as 3.3.0. Driven by continued fuzz testing and stricter query-result assertions, it hardens the storage-policy and posting-index features introduced in 3.3.0 and pulls in a batch of SQL correctness and stability fixes from the OSS engine. It also adds a custom root CA option for the replication object store and an EXCLUDE column list for column-level GRANT/REVOKE.

New Features

Column-level GRANT/REVOKE now supports an EXCLUDE clause to target all columns except a specified set without listing every column explicitly. For example: GRANT SELECT ON foo(* EXCLUDE(a, b, c)) TO my_user;. The * means all columns of the table, optionally followed by EXCLUDE(...) to omit some. At statement time, the table's current columns are enumerated, excluded ones are dropped, and the permission is granted or revoked on each remaining column individually. Columns added after the statement runs are not covered automatically. Additionally, REVOKE is now strict: a REVOKE that names a table or column which does not exist will error instead of silently succeeding, catching typos that would otherwise leave the operator believing access was revoked. GRANT remains lenient about missing tables since Influx Line Protocol can auto-create tables on first ingestion. Every column named in EXCLUDE(...) must exist on the table, and duplicates or typos produce errors. Errors are positioned at the offending token for clear diagnostics.
This feature adds two optional parameters to the object store configuration string for HTTP-based services (s3, azblob, gcs): ca_cert_file=/path/to/ca.pem trusts the CA root certificate(s) in the specified PEM file in addition to the built-in Mozilla/webpki roots, and ca_builtin_roots=false drops the built-in roots to trust only the certificates from ca_cert_file. For example: s3::region=us-east-1;root=/bucket/path;endpoint=https://minio.internal;ca_cert_file=/etc/ssl/private-ca.pem;. Both keys are rejected for the fs service. The PEM is read and parsed during configuration validation, so a missing or malformed file fails fast. This is needed because the replication object store client uses rustls with the bundled webpki-roots only, ignoring the OS trust store, SSL_CERT_FILE/SSL_CERT_DIR, and AWS_CA_BUNDLE. Private or on-prem S3/Azure/GCS-compatible stores fronted by a private CA, self-signed certificate, or TLS-intercepting proxy previously failed the TLS handshake with no way to add the trusted CA. Parsing of the object store configuration string is now also stricter: a ;-separated parameter with no = is rejected with an error instead of being silently dropped.

Bug Fixes

When a partial upload advance encountered the first pending transaction whose WAL segment was still open, it consumed zero sequencer entries and incorrectly reported an error identical to a genuine catastrophic condition (missing part files). This caused the table to drop to the slow retry batch size and sleep the retry interval, even though deferring transactions in open segments is the designed behavior for partial mode. On production systems with many long-lived open WAL segments (e.g., many low-rate Influx Line Protocol writers), this resulted in hundreds of spurious errors per hour, each triggering a slow-mode flip and retry sleep, causing upload throughput to fall behind ingestion rate. This fix ensures the uploader only reports an error when zero entries were read without curtailment (the genuine unreadable-txnlog case). Zero-entry curtailment now takes the existing graceful path with no error, no slow-mode flip, and no retry sleep, and the caller resets to the fast batch size as for any successful advance.
QuestDB Enterprise 3.3.0 beta builds created sys.sp_entries with a drop_native column, but the current schema replaced it with to_remote. These are not equivalent, and since CREATE TABLE IF NOT EXISTS cannot repair an existing beta table, the writer would write into a mislabelled column and the reader would fail to compile its to_remote query. A new system migration (SysMig5) now runs on a primary before the storage-policy reader initializes. When it detects the beta schema, it drops the storage_policies system view so it can be recreated with the current definition, renames sys.sp_entries and sys.sp_links to *_v1_backup to preserve the rows, and fails startup with an actionable message. The next start creates the current schema. The migration is also safe under the replica-first upgrade procedure: StoragePolicyCheckJob catches compilation failures on the replica and retries until the primary is upgraded and the schema changes replicate. StoragePolicyReader now detects when the table ID behind sp_entries/sp_links changes, drops cached trackers, and reloads so the replica self-heals without an extra restart.
This fix addresses a set of correctness, resource, and concurrency bugs in the posting/covering index that surfaced across partition squash, the plain O3 commit path, parquet partition reseal, and the WAL fast-lag apply path. The primary issue was instability or missing rows after partition squash on tables with a POSTING index. When commitDense() rebuilt a large or squashed partition that exceeded the spill budget (cairo.posting.index.indexer.spill.bytes.max), mid-stream compaction persisted sparse generations to the .pv file, but the subsequent dense generation write orphaned them, causing failures for covering indexes or silently dropped rows for non-covering indexes. The fix consolidates through seal() when generations already exist, re-encoding every generation into a single dense gen-0 at offset 0. Additional fixes include: parquet partitions with covering posting indexes now have their .pci/.pc sidecars rebuilt during O3 reseal instead of returning NULL for covered columns; the WAL fast-lag apply path now restores covering configuration before indexing to prevent incomplete sidecars after mid-stream spills; ALTER TABLE ... ALTER COLUMN ... SYMBOL CAPACITY now preserves covering-column indices; a leaked .pv file during parquet reseal spills is now properly cleaned up via deferred seal-purges; and all seal-purge state access is now protected by a single lock for thread safety. A bare INDEX TYPE POSTING with no INCLUDE clause is now non-covering by default — the previous implicit covering behavior was unintended. Existing tables retain their persisted covering flag.
A COUNT() over a keyed GROUP BY subquery filtered on an aggregate alias (a HAVING-style predicate) reported duplicate groups that did not exist. The root cause was in the query optimizer's column propagation: propagateTopDownColumns0() had a guard that re-added grouping keys to top-down columns to prevent column pruning, but it ran before the model's own WHERE/HAVING literals had been emitted. For the COUNT() shape where the outer query contributes no key literals, the keys were left unprotected, pruning collapsed the keyed GROUP BY into a scalar aggregate, and the result was incorrect. The fix ensures retainGroupByKeysAsTopDownColumns() runs twice — once at the original early position and once after WHERE/HAVING and ORDER BY literals have been emitted. The second pass is idempotent via alias deduplication and only adds keys where the early pass missed them.
An aggregate (COUNT(), SUM(), etc.) over a UNION ALL of aliased sub-queries could fail query compilation with an AssertionError, surfacing as a 500 in the Web Console. The root cause was that the optimizer's column-propagation machinery resolved literals across UNION boundaries by name rather than by position. When the outer query selected nothing from the union (an aggregate), this pruned one branch down to a few matching-by-name columns while leaving other branches intact, causing the branches to disagree on column count. Since UNION columns are matched by position, not name, the fix removes the superseded by-name emit in favor of the indexed, by-position propagation. A rollback flag cairo.sql.legacy.union.column.propagation (default false) is available to restore the old behavior if needed.
SHOW CREATE TABLE previously resolved any object kind and rendered a CREATE TABLE statement for it. When run against a view or materialized view, it produced misleading DDL that looked like a table definition and would have created a plain table rather than the view if executed. This fix makes SHOW CREATE TABLE throw an error when the resolved token is a view or materialized view, reporting "table name expected, got view or materialized view name." The dedicated SHOW CREATE VIEW and SHOW CREATE MATERIALIZED VIEW statements should be used instead. Additionally, the sibling guards were tightened for symmetry so that SHOW CREATE VIEW <matview> now reports "got materialized view name" and SHOW CREATE MATERIALIZED VIEW <view> reports "got view name" rather than the generic "got table name."
This fix addresses numerous engine bugs surfaced by expanded query fuzzer coverage. Key fixes include: an off-by-one error when stepping past null array map keys in OrderedMapVarSizeRecord; ASOF join light path failures for STRING/VARCHAR-to-SYMBOL joins; ClassCastException in parallel keyed GROUP BY over covering index factories; memory leaks from row cursor factories not being freed on exceptions or toTop() calls; incorrect scan direction advertisement and duplicate-key handling for multi-key covering index queries; per-worker array key memory leaks in GROUP BY; incorrect group-by alias resolution for outer columns on duplicate references; unsafe LIMIT push through trivial group-by expressions when ORDER BY columns were absent from the GROUP BY; posting-index DISTINCT output named by source token instead of projection alias; JIT narrow-operand widening failures when LONG appeared only as a literal; full-fat ASOF/LT join projection failures with cross-type SYMBOL keys; SymbolConstant.valueOf() returning non-null for VALUE_IS_NULL keys; qualified column resolution failures under DISTINCT; and a missing IN function factory for IPv4 columns. The new InIPv4FunctionFactory accepts NULL, STRING, VARCHAR, SYMBOL, IPv4, and bind variables, using LongHashSet storage to safely cover the full 32-bit IPv4 range.
This fix addresses five production bugs exposed by stronger assertion coverage. Cross join skipRows() was not re-entrant: a second call while the cursor sat partway through a master row's slave scan re-skipped the already-consumed master cursor and dropped remaining slave rows, causing LIMIT and result-size calculation over cross joins to count or skip too few rows. Multi-value indexed latest-by (<indexed symbol> IN (...) LATEST ON ts) emitted result rows in index/partition discovery order without sorting across multiple partitions, yet advertised a forward scan direction, so the optimizer elided ORDER BY ts and returned unsorted rows. Parquet random access could read freed memory under some conditions: PageFrameMemoryPool.navigateTo() early-returned when the record's frame index matched the requested one, but an AsyncFilteredRecordCursor record could be bound to a reduce task's parquet buffers that were freed eagerly on collect, causing subsequent column reads to dereference freed native memory. Dense ASOF join produced stale results when its cursor was re-read because toTop() did not reset backwardScanExhausted, causing matches to be dropped on every pass after the first. Finally, COUNT(*) over a full scan threw on an empty partition missing from disk because calculateSize() opened every partition unconditionally, unlike next() which already skips zero-row partitions.

QuestDB 9.4.2 is a hardening release that builds on 9.4.1, driven by continued fuzz testing and stricter query-result assertions. It includes bugfixes for the new posting index, parquet tables, temporal joins, and a variety of other SQL queries. This release also brings a Web Console upgrade, which brings some fixes and a new utility for sharing queries with your colleagues.

New Features

This feature adds the ability to share links to queries directly from the Web Console editor. Option/Alt + L copies a link to the current query at the cursor (or the selected portion when a selection exists), while Option/Alt + Shift + L copies a link to all queries in the tab. These actions are also available from the run button dropdown as "Copy link to the query" and "Copy link to all queries" respectively, using the same mechanism as demo queries on the documentation site.
This feature adds two related capabilities for QuestDB Enterprise. The materialized view generator now correctly handles tables with STORAGE POLICY(...) by projecting each clause into the materialized view's partition unit (DAY → DAYS, MONTH → MONTHS, YEAR → YEARS), preserving the required ordering of clauses, and ladder-bumping the terminal clause so the materialized view retention outlives the source. TTL is floored at the partition unit. The table details drawer in the Web Console now surfaces storage policy information: when clauses are configured, each is rendered as a label/value card; on QuestDB Enterprise with no policy, a "Not configured" placeholder is shown. The TTL card only appears when a TTL value is actually configured.

Bug Fixes

This fix addresses a set of correctness, resource, and concurrency bugs in the posting/covering index that surfaced across partition squash, the plain O3 commit path, parquet partition reseal, and the WAL fast-lag apply path. A table with a posting index could crash the JVM (covering index) or return short/incorrect indexed counts (non-covering index) after partitions were squashed, because commitDense() assumed it wrote the first and only generation at offset 0, but a mid-stream spill flush could persist sparse generations earlier, orphaning them. The fix makes commitDense() consolidate through seal() when generations already exist. Additionally, covered reads returned NULL after an O3/squash reseal of a parquet partition because the O3 worker that rewrites a parquet partition built only the non-covering .pv and sealPostingIndexForPartition previously skipped parquet partitions entirely. On the WAL fast-lag apply path, a mid-stream spill flush during updateIndexesParallel could write an incomplete .pc covered sidecar. ALTER TABLE ... ALTER COLUMN ... SYMBOL CAPACITY did not carry over covering-column indices, so the next metadata rewrite lost the covering flag. A parquet reseal value-file leak was fixed by handing the rebuild's seal-purges to the TableWriter's deferred queue. Concurrency hardening adds a single lock for all seal-purge state accessed by parallel O3 workers.
Reading a Parquet partition could fail when an integer-family column had a DELTA_BINARY_PACKED data page containing only nulls. The integer writer short-circuited and emitted an empty values buffer with no delta header for pages with zero non-null values, causing the reader to decode block_size = 0 and reject it. This failure showed up on eager full-row-group reads such as converting a partition back to native or WAL apply re-running that conversion under replication, suspending the table on both primary and replica. The reader now treats an empty values buffer as a zero-value page and returns an iterator that yields no values, keeping already-written Parquet files readable. The writer now emits a self-describing delta header (value_count = 0) for all-null integer pages so newly written files are valid Parquet. Additionally, several reachable panics in the Parquet read path were closed — these could be triggered from the public read_parquet() SQL function or from partition conversion on malformed or foreign-produced pages. A Rust panic crosses the JNI boundary as a JVM abort with no recovery. The decoders and vendored parquet2 now reject these inputs with clean errors: DELTA_LENGTH_BYTE_ARRAY varchar pages with bit widths above 32, DELTA_BINARY_PACKED headers with invalid block sizes or zero miniblock counts, per-miniblock bit widths above 64, oversized block sizes causing multiply overflows, and ULEB128 varint values wider than 64 bits.
This fix addresses 18 engine bugs surfaced by expanded query fuzzer coverage. Key fixes include: an off-by-one in ArrayTypeDriver.getPlainValueSize() when stepping past null array map keys that caused assertion failures; ASOF Light path crashes for STRING/VARCHAR-to-SYMBOL joins due to an unimplemented SymbolJoinKeyMapping.of(RecordCursor) default; ClassCastException when parallel keyed GROUP BY ran over a CoveringIndex factory because CoveringPageFrameCursor did not extend TablePageFrameCursor; memory leaks of per-symbol PostingIndexFwdReader.Cursor instances when HeapRowCursor.of() threw mid-iteration; honest scan direction reporting and duplicate-key deduplication for multi-key CoveringIndex queries to prevent incorrect SAMPLE BY and ORDER BY behavior; per-worker array key memory leaks in GROUP BY where DirectArray backing memory was orphaned; incorrect group-by alias resolution causing wtf? errors on SELECT DISTINCT with duplicate column references; a gate on trivial-group-by LIMIT push to prevent ArrayIndexOutOfBoundsException when ORDER BY columns were absent from the GROUP BY; JIT narrow-operand widening when LONG appeared only as a literal, causing int32 wrapping divergence from the Java path; full-fat ASOF/LT join projection crashes with cross-type SYMBOL keys; SymbolConstant.valueOf() ignoring VALUE_IS_NULL keys causing stale values instead of NULL; qualified column reference resolution failures under DISTINCT; and a new InIPv4FunctionFactory for IN predicates on IPv4 columns that previously fell back to incompatible STRING comparison.
This fix addresses five production bugs exposed by strengthened query assertions. Parquet random access could read freed memory and crash the JVM: PageFrameMemoryPool.navigateTo() early-returned when the record's frame index matched the requested one, but an AsyncFilteredRecordCursor record could be bound to a reduce task's parquet buffers that were freed eagerly on collect, causing a use-after-free SIGSEGV on the next column read. The fix always re-navigates parquet records while keeping the early-return optimization for native frames. Multi-value indexed latest-by (LatestByValuesIndexedRecordCursor) emitted result rows in index/partition discovery order without sorting, yet advertised forward scan direction, so the optimizer elided ORDER BY ts and returned unsorted rows across multiple partitions. The fix sorts results into ascending designated-timestamp order. Dense ASOF join produced stale results when its cursor was re-read because toTop() did not reset backwardScanExhausted and of() reset none of the five scan fields, causing matches to be dropped on every pass after the first. Cross join skipRows() was not re-entrant — a second call while partway through a master row's slave scan re-skipped the already-consumed master cursor, causing LIMIT and result-size calculation to count too few rows. Finally, count(*) over a full scan threw on an empty partition missing from disk because calculateSize() opened every partition unconditionally unlike next() which already skips zero-row partitions.
This fix corrects a bug where count() over a keyed GROUP BY subquery filtered on the aggregate alias (a HAVING-style predicate) reported duplicate groups that did not exist. For example, SELECT count() dups FROM (SELECT ts, s, count() c FROM tab) WHERE c > 1 returned 1 instead of the expected 0 when all pairs were unique. The root cause was in propagateTopDownColumns0(), which had a guard to re-add a group-by subquery's grouping keys to its top-down columns so column pruning could not drop them, but this guard ran before the model's own WHERE/HAVING and ORDER BY literals had been emitted into the top-down list. For the count() shape, the outer query contributed no key literals, so the HAVING alias was the sole top-down contributor added only after the guard had already observed an empty list and skipped. The keys were left unprotected, pruning collapsed the keyed group-by into a scalar aggregate, and the total row count falsely passed the filter. The fix runs retainGroupByKeysAsTopDownColumns() twice: once at the original early position and once after WHERE/HAVING and ORDER BY literals have been emitted. The second pass is idempotent via addTopDownColumn() deduplication and only affects query compilation with no measurable performance change.
An aggregate function (such as count() or sum()) over a UNION ALL of aliased sub-queries crashed query compilation with an AssertionError, surfacing as a 500 error in the Web Console. The root cause was that the SQL optimizer's column-propagation logic emitted columns across UNION boundaries by name rather than by position. When aliases differed between branches, this pruned some branches incorrectly, causing a column count mismatch. This fix removes the superseded by-name emit in favor of the correct indexed, by-position propagation. A rollback flag cairo.sql.legacy.union.column.propagation (default false) is available to restore the old behavior if needed.
Previously, SHOW CREATE TABLE resolved any object kind and rendered a CREATE TABLE statement for it. When run against a view or materialized view, it produced misleading DDL that, if executed, would have created a plain table rather than the view. This fix makes SHOW CREATE TABLE reject views and materialized views with a clear error message. The sibling guards were also tightened so that SHOW CREATE VIEW run against a materialized view (and vice versa) now reports the actual object kind in the error. Users who previously relied on SHOW CREATE TABLE for views should use SHOW CREATE VIEW or SHOW CREATE MATERIALIZED VIEW instead.
When a user logged in via SSO and then logged out, the Web Console kept the client ID and automatically re-ran the OAuth flow on page refresh, effectively preventing logout. This fix introduces an explicit SSO_SESSION_ACTIVE flag that is set to false on logout to prevent auto-login on page refresh. The page now auto-refreshes on logout to clear stale information from the previous user's session, such as grid results, preventing data leakage into the next login.
This fix updates the validation debounce timeouts in the editor to 1 second while typing and 500ms when the cursor is shifted between queries.

June 3, 2026

QuestDB Enterprise 3.3.0 is a feature release built on the QuestDB 9.4.x engine. The headline additions are the new storage policy engine for tiering data between native and Parquet (local and object storage), posting index support, a new ingestion server and wire protocol with durable acknowledgements (QWP), and a move to JDK 25.

Breaking Changes

This fix reworks the storage policy SQL surface and lifecycle in QuestDB Enterprise. Previously, TO PARQUET only generated a data.parquet file alongside native columns without switching reads, and a separate DROP NATIVE step was required to flip the partition to Parquet format. This left an unused Parquet file on disk in the gap between the two operations. Now, TO PARQUET produces data.parquet and immediately removes the native columns at its TTL, eliminating the unused-file gap. DROP NATIVE is removed and replaced by TO REMOTE, which will handle uploading a partition's Parquet data to object storage once that functionality is implemented. The storage policy clauses are now TO PARQUET, TO REMOTE, DROP LOCAL, and DROP REMOTE. Example usage: ALTER TABLE t SET STORAGE POLICY(TO PARQUET 7d, TO REMOTE 30d, DROP LOCAL 90d). The drop_native column in sys.sp_entries and the storage_policies view is renamed to to_remote. TO PARQUET and TO REMOTE are not ordered relative to each other, allowing TO REMOTE < TO PARQUET for keeping both formats on local disk. The parser enforces ordering constraints such as TO PARQUET ≤ DROP LOCAL and TO REMOTE ≤ DROP LOCAL.

New Features

A storage policy defines how a table manages its cold-storage lifecycle, controlling when partitions are converted to Parquet, when native data is dropped, and when local data is removed. A policy consists of three optional settings: TO PARQUET <ttl> (convert partition to Parquet after this time), DROP NATIVE <ttl> (delete native partition, keeping only Parquet), and DROP LOCAL <ttl> (remove all local partitions). All TTL values must be positive, and later clauses must specify a TTL greater than or equal to earlier ones. Storage policies can be defined inline during table or materialized view creation (CREATE TABLE abc (...) STORAGE POLICY(TO PARQUET 3d, DROP NATIVE 10d, DROP LOCAL 1M) WAL), modified (ALTER TABLE abc SET STORAGE POLICY(...)), enabled/disabled without removal (ALTER TABLE abc DISABLE STORAGE POLICY), dropped (ALTER TABLE abc DROP STORAGE POLICY), and queried (SELECT * FROM storage_policies). The implementation includes a two-pipeline processing system: Pipeline 1 handles PARQUET_CONVERSION (reader-only) followed by PARQUET_COMMIT (writer-required), while Pipeline 2 handles direct PARQUET_COMMIT for partitions already converted but needing to switch to Parquet-only format. Staleness detection uses a squash tracker mechanism with an 8-bit counter, overflow fallback via timestamp file, and row count comparison to prevent data loss from in-place squashes. Transaction metadata uses bit 60 (parquetGenerated) and bit 61 (parquetFormat) flags per partition. Only inactive partitions can be converted, and tables must have at least 2 partitions. The feature includes four new permissions (SET_STORAGE_POLICY, REMOVE_STORAGE_POLICY, ENABLE_STORAGE_POLICY, DISABLE_STORAGE_POLICY), configurable check intervals and retry behavior, and SHOW CREATE TABLE/SHOW CREATE MATERIALIZED VIEW output includes the storage policy clause. Legacy TTL settings are deprecated in QuestDB Enterprise — SET TTL <non-zero> is rejected, but SET TTL 0 is allowed as a migration path. An ObjectStoreParquetDispatcher interface is scaffolded for future remote object-store integration, with DROP REMOTE syntax reserved but not yet supported.
This feature connects the Rust WAL uploader to the DurableAckRegistry interface so that QuestDB Wire Protocol connections receive STATUS_DURABLE_ACK frames once committed data reaches the object store. A new DurableUploadRegistry implements DurableAckRegistry using a lock-free ConcurrentHashMap<AtomicLong> to track per-table upload watermarks via monotonic CAS. It handles native UTF-8 directory name pointers from Rust without heap-allocating on the hot path, using DirectUtf8String for ASCII fast-path lookup and only materializing a heap String on first sight of a table. When a table is dropped, a MAX_VALUE sentinel is set so pending durable-ack entries unblock immediately, and late uploads after drop are absorbed by the CAS. The Rust uploader calls a new segmentUploaded JNI method after each successful segment and index upload. Rust JNI error handling was hardened: get_current_env() was removed in favor of try_get_current_env() returning Option, so log calls from non-JNI threads fall back to eprintln! instead of panicking, and call_method/call_void_method return anyhow::Error instead of panicking on missing JVM or unexpected return types.
This feature adds enterprise-side plumbing for the posting index. Backup support recognizes new posting-index file types (.pk, .pv, .pci, .pd, .pc0..pcN), including multi-segment versioned forms. DDL and backup-restore paths now track per-column index type and covering column sets through getIndexType and getCoveringColumnIndices delegates in create table, materialized view, and view operation implementations. The MetadataService.addColumn and changeColumnType methods accept a byte indexType parameter instead of boolean isIndexed. The REINDEX permission check now covers posting-indexed columns alongside legacy bitmap-indexed ones when granted at the table level without an explicit column list. The WAL transfer path removes a per-call String allocation by writing the partId-prefixed upload.pending file via Path.put. Rust uploader error handling was improved by replacing unwrap()/expect()/debug_assert_ne! with Result and ? propagation, and a divide-by-zero guard was added for corrupted V2 headers.

Improvements

This improvement introduces two related changes to QuestDB Enterprise replication. First, upload-amplification observability is added via per-table useful_* upload counters that make write amplification measurable, including replication_up_uploaded_bytes_total, replication_up_uploads_total, replication_up_useful_upload_bytes_total, and replication_up_useful_uploads_total. Amplification ratios can be computed as a single clean division (e.g., rate(uploaded_bytes_total[5m]) / rate(useful_upload_bytes_total[5m])). Second, the primary throttle window is lowered from 10s to 1s on all cloud backends (S3, Azure, GCS, FS), and the index-upload coalescer is enabled per cloud class, reducing open-segment replica lag from ~10s to ~2s. The index-upload coalescer merges all tables' index updates into at most one PUT per interval for the whole primary, decoupling index-PUT rate from both the window and the table count. For non-GCS backends, the index-upload throttle is set to 10ms, capping index PUTs at 100/s per primary regardless of fleet size. For GCS, the 1s throttle respects the per-object write cap but operates at the limit with no margin. The idle CPU usage of the coalescer wake-up was also fixed, reducing idle primary CPU from ~107% to ~7% of one core. A non-GCS deployment that explicitly set replication.primary.throttle.window.duration or replication.primary.requests.retry.interval to a sub-10ms value and left the index-upload throttle at default will now fail to boot; setting replication.primary.index.upload.throttle.interval=0 restores the previous behavior.

Bug Fixes

The SQL validation endpoint (/api/v1/sql/validate) is intended to check syntax without side effects or authorization enforcement. However, QuestDB Enterprise was executing DDL inline during compilation and enforcing permissions, so validating statements like CREATE USER, GRANT, REVOKE, BACKUP DATABASE, and storage policy changes would actually modify server state or fail on authorization. This fix guards each mutating call in EntSqlCompilerImpl with isValidationOnly(), preserving parsing and query-type reporting so validation still reports syntax errors and the correct statement type. A new EntValidationSecurityContext extends the allow-all context to bypass authorization during validation while delegating identity information (principal, session principal, auth type) to the real context. All QuestDB Enterprise security contexts implement asValidationContext() with lazy caching. OWNED BY parsing during CREATE TABLE, CREATE MATERIALIZED VIEW, and CREATE VIEW also reads identity through this validation view.
This fix enables recovery of sequencer metadata from the committed WAL log, ensuring metadata consistency after failures.

QuestDB 9.4.1 is a hardening release that builds on 9.4.0. Introduction of our new SQL fuzz-testing engine and test framework improvements has flushed out more than 60 latent bugs around query correctness and resource leaks. 9.4.1 also brings additional hardening and bugfixes for the newly released posting/covering indexes, performance enhancements for materialised views, upgrades to parquet querying, and new aggregate and window functions.

New Features

This feature introduces the array_agg() aggregate function that collects per-group double values or arrays into a DOUBLE[] result. Both array_agg(DOUBLE) and array_agg(DOUBLE[]) preserve input order across keyed and non-keyed GROUP BY and SAMPLE BY, including parallel execution. The function supports FILL(NONE), FILL(NULL), and FILL(PREV) in SAMPLE BY, while FILL(LINEAR) and FILL(VALUE) are rejected at compile time. NULL inputs to array_agg(DOUBLE) are preserved as null elements in the output array, while NULL or empty arrays passed to array_agg(DOUBLE[]) are skipped during concatenation. The output column type is DOUBLE[], suitable for downstream array functions such as array_count, array_sum, and indexing. Memory usage scales with total element count at 16 bytes per element in the build buffer, and CairoConfiguration.maxArrayElementCount caps per-group element count. As part of this change, several pre-existing SAMPLE BY ... FILL correctness bugs were fixed: fill values applied after aggregate arithmetic now return the correct fill value instead of the fill value transformed by the arithmetic; duplicate aggregates with different fill values now each receive their own fill; and multi-fill validation now correctly matches each aggregate against its corresponding fill entry using 0-based indexing.
This feature introduces regr_r2(y, x), the standard SQL coefficient-of-determination aggregate that reports how well a linear regression of Y on X fits the data on a 0–1 scale. This is useful for separating real time-series trends from noise — for example, finding sensors whose temperature is trending up with high confidence versus those drifting randomly. The function implements SQL:2003 §10.9 semantics: when Sxx = 0 it returns NULL (covering single-row and constant-X cases), when Syy = 0 with Sxx ≠ 0 it returns 1.0 (a horizontal line is a perfect fit when Y is constant), and otherwise returns Sxy² / (Sxx · Syy). The implementation introduces a shared AbstractRegressionGroupByFunction base class with a 6-slot Welford state (meanY, Syy, meanX, Sxx, Sxy, count) that provides per-row update and Chan parallel merge in one place. The existing regr_slope and regr_intercept functions were migrated onto this shared base, significantly reducing their code size. The semantics diverge from corr() at one edge: when Y is constant and X varies, regr_r2 returns 1.0 per SQL:2003, while corr() returns NULL.
This feature adds window function support for all six DECIMAL sub-types (D8, D16, D32, D64, D128, D256). The covered functions include first_value, last_value, nth_value, lag, lead, min, max, count, sum, avg, and avg(x, n). Each factory covers the same frame shapes already supported by existing primitive-typed window functions: whole partition, current row, ROWS BETWEEN, RANGE BETWEEN, partitioned and non-partitioned forms. For output types, first_value, last_value, nth_value, lag, lead, min, and max return the input decimal type; count returns LONG; sum widens to reduce overflow risk (D8/D16 widen to D64, D32/D64 widen to D128, D128/D256 widen to D256); avg(x) matches the input type; and avg(x, n) returns D256 with target scale n. NULL inputs are skipped in aggregations and propagate through value-access functions per existing window semantics. D128 and D256 ring-buffer slots are 16 and 32 bytes respectively, so memory usage for wide windows over these column types is correspondingly larger than for primitive numeric window functions.

Improvements

This improvement introduces six new configuration properties that provide opt-in byte-denominated caps on native memory growth for window function and ORDER BY operators: cairo.sql.window.cache.max.bytes, cairo.sql.window.rowid.max.bytes, cairo.sql.window.tree.max.bytes, cairo.sql.sort.key.max.bytes, cairo.sql.sort.light.value.max.bytes, and cairo.sql.sort.value.max.bytes. All caps are unset (uncapped) by default; setting any of them bounds the operator and raises a LimitOverflowException with a hint naming the specific configuration key that needs to be raised. The previous page-based configuration keys (cairo.sql.window.tree.max.pages, cairo.sql.window.rowid.max.pages, cairo.sql.sort.key.max.pages, cairo.sql.sort.light.value.max.pages, cairo.sql.sort.value.max.pages) are now deprecated but continue to be parsed — if a user has one set, the derived byte default becomes pageSize * maxPages, and an explicit new *.max.bytes value takes precedence when both are set. When sort-key materialization is engaged, the cairo.sql.sort.key.max.bytes budget is split across materialized column buffers in proportion to each column's fixed-size width, so each buffer's row capacity stays roughly balanced. Page-size configuration keys now clamp to a minimum of 1 byte at read time, preventing misconfigured zero values from propagating into downstream divisions.
This improvement adds a cost-aware clustering pass to materialized view refresh that skips unchanged buckets when an out-of-order historical write lands far behind the current commit position. Previously, the incremental refresh scanned every non-empty bucket between the O3 timestamp and the current position, including pre-existing ones that no WAL transaction touched. The optimization uses two rolling exponential moving averages on materialized view state (average commit latency and average scan latency per timestamp-unit) to drive a gap-width threshold below which adjacent cached intervals are merged into clusters. Each cluster receives its own iterator step sized to its width, so a step-group never straddles two clusters and the existing gap-skip excises gaps cheaply. Benchmarks show improvements from 160ms down to ~1.7ms (94x faster) for a 512-symbol materialized view with a 720-minute O3 lag. The materialized_views() function gains three columns (refresh_avg_commit_nanos, refresh_avg_scan_nanos_per_ts_unit, refresh_gap_threshold_ts_units) for operator visibility into the cost model, and REFRESH MATERIALIZED VIEW <name> STATS resets the EMAs when workload shape changes. A new configuration key cairo.mat.view.refresh.max.clusters (default 32) caps the number of clusters per refresh to prevent pathological many-disjoint-intervals workloads from emitting hundreds of tiny commits.
This improvement prevents materialized views from emitting bursts of no-op replace-range WAL transactions when the base table has an apply backlog. Previously, period views bypassed a guard that checked whether the base transaction watermark had actually advanced, causing each refresh iteration to commit a no-rows resetMatViewState WAL transaction even when no new data existed. Under sustained backlog, a single view could emit millions of such transactions in minutes, bloating the view's WAL and stalling WAL apply on replicas. The fix ensures insertAsSelect() only commits a watermark transaction when it actually advances (commitBaseTxn > lastRefreshBaseTxn or commitPeriodHi > lastPeriodHi). Additionally, the refresh acknowledgement logic now tracks the minimum base transaction examined across refreshed views, allowing the clean/dirty transaction handshake to converge sooner and reducing redundant re-enqueues.
This improvement reworks the parallel and serial execution of the twap() and sparkline() aggregate functions, achieving 2-3x speedups on common parallel and concurrent workloads. The parallel aggregates buffer per-group observations and sort them at merge and read time. Previously, the sort recovered run boundaries by scanning for key decreases, which conflated page frames whose key ranges ascended across a gap and forced element-wise merging. The new approach records exact per-frame batch boundaries in a lazily allocated descriptor buffer, allowing the sort to permute whole batches with single bulk copies. Consecutive frames are coalesced into a single batch, avoiding descriptor allocation entirely for groups whose frames arrive in order. Additional hot-path tuning includes insertion sort for small batch counts (n ≤ 16), pre-grown scratch lists, and consolidated memory allocation calls. The high-cardinality memory-stress scenario (1M groups, ~10 observations per group) regresses by approximately 15% due to per-group state widening from 24 to 56 bytes without enough observations to amortize the new merge step. The twap() function now validates at compile time that its timestamp argument is the table's designated timestamp, since the batch model relies on page frames being pre-sorted by that column.
This improvement enables parallel execution of the low-precision approx_percentile() function when used with LONG column arguments by replacing the heap-based HdrHistogram with an off-heap GroupByHistogram in ApproxPercentileLongGroupByFunction.
This improvement removes the non-keyed vector group-by factory (GroupByNotKeyedVectorRecordCursorFactory) and routes all non-keyed aggregation queries through AsyncGroupByNotKeyedRecordCursorFactory exclusively. To maintain parity, new batch-aware group-by functions were added: avg(int), avg(long), min(short), and max(short). Existing functions gained batch implementations on the async path: ksum(double), nsum(double), and sum(long256). Additionally, min(timestamp) and max(timestamp) now short-circuit the column scan when the argument is the designated timestamp by reading the first or last row of the page frame directly, reflected in query plans as min_designated / max_designated. A bug in SumShortGroupByFunction.getComputeBatchArgType that returned INT instead of SHORT was also corrected. Query plan output changes for non-keyed aggregations that previously surfaced as GroupByNotKeyed Vectorized and for designated timestamp min/max aggregations.
Previously, _pm sidecar files produced by Mig940, TableSnapshotRestore, and the attach-existing-parquet code path in TableWriter did not inline Parquet bloom-filter bitsets. This meant queries with equality predicates on bloom-indexed columns had to read the bitset from data.parquet at plan time on partitions that arrived through one of these paths. This improvement closes that gap by having convert_from_parquet read each chunk's bitset via parquet2::bloom_filter::read_from_slice_at_offset and inline it through RowGroupBlockBuilder::add_bloom_filter. The ParquetMetadataWriter.generate JNI bridge now always mmaps the parquet file and passes that slice, so all three Java call sites produce _pm files identical to the write path's output. MIGRATION_VERSION bumps from 428 to 429, so installs that already ran the 428 migration will overwrite their existing _pm files on the next startup. Databases already on migration 428 pay a one-off _pm regeneration cost on next startup, comparable to the original 428 migration run. The _pm file size grows by the size of any inlined bitsets for tables with bloom configuration, while tables without bloom configuration see no size change.

Bug Fixes

This fix addresses 55 distinct bugs across SQL query execution, JIT compilation, parallel aggregation, and resource management. Key corrections include: parallel reduce/filter now preserves original exception types instead of wrapping them as CairoException; SymbolFunction.getStrLen() no longer throws UnsupportedOperationException; SAMPLE BY ... FILL(value) properly rejects incompatible fill/aggregate type combinations; JIT no longer silently truncates out-of-range BYTE/SHORT literals or computes narrow arithmetic at incorrect widths; cast-to-symbol functions are now marked as thread-unsafe to prevent shared mutable state across parallel workers; parquet pushdown no longer truncates overflow-folded long constants on narrow-int columns; WhereClauseParser no longer clobbers FALSE intrinsic values set by earlier conjuncts; MinCharGroupByFunction.merge() no longer overruns the 2-byte CHAR slot; string/varchar comparison short-circuits on NULL constants now correctly handle negation; LIMIT push-down no longer propagates past GROUP BY, SAMPLE BY, WINDOW, or HORIZON JOIN; nested INT arithmetic in LONG-context predicates now correctly widens through subtrees via getLong(); and multiple resource leaks in factory construction error paths are resolved. Three behavior changes are included: INT - FLOAT now returns FLOAT instead of DOUBLE (matching the other arithmetic operators); column <= null and column >= null for STRING/VARCHAR now return matching NULL rows under QuestDB's NULL = NULL -> true convention; and nested INT arithmetic getLong() overrides now recurse through subtrees at long width.
This fix addresses nine latent production bugs across multiple components. Array function factories (DoubleArrayAddFunctionFactory, DoubleArrayDivFunctionFactory, DoubleArrayMultiplyFunctionFactory, DoubleArraySubtractFunctionFactory) now restore the declared array type before reusing the output buffer, so a null array row preceding a non-null one no longer causes failures. GenerateSeriesTimestampRecordCursorFactory and GenerateSeriesTimestampStringRecordCursorFactory now report scan direction only for constant steps, returning SCAN_DIRECTION_OTHER for bind-variable steps instead of reading from an unbound function. AsyncWindowJoinRecordCursor.calculateSize() now correctly advances the cursor past the last frame, preventing a subsequent hasNext() from wrongly returning true. AsyncFilteredRecordCursorFactory.recordCursorSupportsRandomAccess() no longer incorrectly delegates to the base factory, since the async-filtered cursor always supports random access through its own page-frame memory pool. PageFrameRecordCursorImpl.toTop() now properly releases the row cursor through Misc.free instead of nulling it without freeing, preventing a resource leak.
This fix resolves a SIGSEGV crash that occurred when an out-of-order or WAL commit deduplicated against a STRING dedup key whose per-partition data file had grown beyond 2 GiB. The native dedup merge comparer read the variable-length data offset into a 32-bit integer, but offsets are stored as 64-bit values. Once a partition's STRING .d file exceeded 2 GiB, offsets greater than or equal to 2^31 were truncated to negative values, causing a wild pointer dereference. Because this occurred during WAL apply or O3 commit, the offending transaction was replayed on every restart, causing an affected table to enter a boot-time crash loop. The fix reads the offset as int64_t. BINARY columns already used int64_t and VARCHAR uses a separate comparer, so neither is affected.
This fix resolves a crash (SIGSEGV) or stale data issue when querying only columns that were added via ALTER TABLE ADD COLUMN after a partition was converted to Parquet. When the projected column set contained no column present in the Parquet file, PageFrameMemoryPool.ParquetBuffers.decode() skipped sizing and zeroing the per-column page-address lists, leaving them with stale or uninitialized native memory. PageFrameMemoryRecord then dereferenced invalid page addresses, causing either a crash on freshly allocated buffers or wrong non-NULL values on reused buffers. The fix hoists remapColumns() out of the parquetColumns.size() > 0 guard so address lists are always sized and zeroed, and adds an early return for zero-column reads such as COUNT(*). Absent Parquet columns now correctly read as NULL, matching native-partition behavior. Queries that read at least one column present in the Parquet file are unaffected.
This fix corrects TTL validation for materialized views that use months-based TTL values. QuestDB encodes TTL as a single integer where positive values represent hours and negative values represent months. When a materialized view omitted an explicit PARTITION BY, the derived partition validation guard only checked for positive TTL values (ttlHoursOrMonths > 0), allowing months-based TTL values to bypass granularity validation entirely. This could result in invalid configurations, such as TTL 7 months being accepted on a view with a derived YEAR partition, even though 7 months is not a whole number of years. The fix widens the guard to ttlHoursOrMonths != 0, ensuring months-based TTL values go through the same PartitionBy.validateTtlGranularity() check as hours-based ones. The error message for unrecognized tokens after ALTER MATERIALIZED VIEW <name> now also lists set as a valid continuation.
This fix resolves an issue where tables with a posting-indexed symbol column (TYPE POSTING, TYPE POSTING DELTA, TYPE POSTING EF) could not be converted to Parquet via storage policy. The TableWriter.linkPartitionIndexFiles method only hard-linked the legacy bitmap layout (.k / .v files), but posting-indexed columns have no .v file — their value data lives in .pv.<colTxn>.<sealTxn> plus .pci sidecar and per-column .pc<N>.{colTxn}.{sealTxn} data files. The link against the non-existent .v returned ENOENT and the commit aborted, leaving the partition stuck partway through the switch on every retry. A new helper linkColumnIndexFiles now owns the per-column hard-link step, routing key/value file paths through IndexFactory which resolves to the correct files for both bitmap and posting index types. For posting columns, it also hard-links the .pci sidecar and per-column data files for the live generation.
A bound check was missing in the asmjit BitVectorRangeIterator::next_range against the iterator's end position. Without it, the iterator could surface a free bit past the search-region end whenever the end landed mid-BitWord. The caller then computed a range size via unsigned subtraction that underflowed, causing JitAllocator::alloc to accept an area index outside the block. The returned Span reported its requested size while the underlying memory ran past the block boundary into unmapped pages. This manifested in production as a SIGSEGV inside JitRuntime::_add's memcpy on a network worker JIT-compiling a SQL filter. This fix updates the asmjit dependency to include the corrected bound check.
The corr, stddev_pop, stddev_samp, var_pop, var_samp, covar_pop, and covar_samp functions were returning NULL on sparse-NULL data when using parallel GROUP BY. The merge algorithm called when parallel workers need to merge partial results of the aggregate was not considering empty partials. When partials were empty, a NaN value was being propagated and surfaced as a SQL NULL. This fix applies a merge guard to skip empty partials across all affected aggregates.
This fix addresses several gaps in the ALTER ADD POSTING INDEX ... INCLUDE (covering index) Parquet path. Seal never read Parquet column data, was not exercised for a WAL table's last partition, and could leak file descriptors or temp files on error paths. Large sealed blocks and high-cardinality multi-key covering scans could also exceed RSS_MEM_LIMIT, and one multi-key resume branch could loop forever returning the same frame. Key changes include: Parquet-aware seal that routes Parquet partitions through a dedicated indexParquetPartition path with batched materialisation across row groups; streaming FSST compression that splits the one-shot compress into a four-call lifecycle to keep anonymous-heap scratch in the low MiBs regardless of stride size; chunked FSST decompression that imports the symbol table once per block and decompresses 256 values per access instead of the entire block; immediate buffer freeing during cursor grow operations instead of accumulating all prior generations; a hard cap of 1,000,000 rows per PageFrame with resumable fill via parked RowCursor; and a fix to advance currentKeyIdx in the multi-key resume branch when a parked RowCursor drains, preventing infinite loops. Robustness improvements include pre-deleting leftover temp files before WAL-apply retries, tracking both temp file paths before mmapping, and using quiet removal to avoid masking real failures.
Indexed WHERE col = null could return zero rows from a Parquet partition whose SYMBOL column was entirely NULL in a row group. The Parquet decoder has a documented optimisation where it skips materialising the buffer and returns size = 0 when a column chunk's stats report null_count == num_values. Three call sites that rebuild a covering index from a decoded Parquet chunk shared the same defect: O3PartitionJob.updateParquetIndexes (after O3 insert rewrites), TableWriter.indexParquetPartition (during ALTER TABLE ALTER COLUMN ... ADD INDEX), and TableSnapshotRestore.rebuildTableFiles (during checkpoint restore). In all three sites the walk iterated zero bytes and wrote no index entries, resulting in an index that recognised no NULL rows. This fix detects the size == 0 convention and emits explicit null index entries for every row the row group covers, extracted into BitmapIndexUtils.addNullEntries so the convention is handled in one place.
The SQL validation endpoint (/api/v1/sql/validate) was not fully side-effect free: it still enforced permissions, ran some statements inline during compile, and emitted query-progress log lines plus error metrics. This fix makes validation-only compilation side-effect free. Inline execution is now skipped via fine-grained isValidationOnly() guards around individual side-effecting calls for REINDEX, TRUNCATE, VACUUM, RESUME/SUSPEND WAL, ALTER ... SET TYPE, ALTER VIEW/CREATE OR REPLACE VIEW, and COMPILE VIEW, while still running the full compile path to resolve target objects and check semantics. Authorization is bypassed through a new SecurityContext.asValidationContext() that returns a no-op allow-all view during validation. Logging and metrics are suppressed by having shouldLogSql() return false and QueryProgress.logError() return early in validation-only mode. Validating a statement against a non-existent table now correctly reports "table does not exist" instead of passing as syntactically valid. The compiled record cursor factory is freed for every non-SELECT statement and validationOnly is reset in a finally block to ensure the no-op security context view cannot outlive the request.
Adding a covering index via ALTER TABLE ... ALTER COLUMN <sym> ADD INDEX TYPE POSTING INCLUDE (...) threw a NullPointerException and suspended the table when the table's most recent partition was stored as Parquet. The indexLastPartition() method had no branch for Parquet partitions, so it attempted to index a Parquet last partition as if it had native column files, causing the covering seal() to dereference a FilesFacade that was never set for the Parquet path. This fix detects a Parquet last partition and delegates to indexParquetPartition(), the same routine already used by indexHistoricPartitions(). Non-WAL tables were unaffected since CONVERT PARTITION TO PARQUET skips the active partition, so a non-WAL table's last partition is always native.
Queries using SELECT ... LIMIT -N over a covering (posting) index with a residual filter silently returned the first N rows instead of the last N. The covering factory ignored the requested scan order when serving a page-frame cursor, always opening ascending partition frames. The parallel negative-limit machinery then collected the lowest-timestamp rows believing they were the highest. For single-key covering queries, getPageFrameCursor now honors descending order by iterating partitions latest-to-earliest and splitting row ranges into sub-frames emitted highest-first, preserving parallel filtering and enabling early termination for small N. For multi-key covering queries, which lack global timestamp ordering, the fix routes negative-limit queries to the serial FilteredRecordCursorFactory where LimitRecordCursorFactory computes last-N correctly via size plus skip. Bind-variable limits with unknown sign at compile time also take the serial path. Positive-limit and no-limit multi-key queries remain unaffected and still run in parallel.
A query filtered on a POSTING-indexed SYMBOL column could return fewer rows than a full scan of the same data, with no error reported. This occurred after an out-of-order insert that failed before its commit became durable. The write was correctly rolled back, but a background cleanup step may have already deleted index files the recovered table still needed, causing the index to fall back to an older generation whose files were gone. This fix holds back index-file cleanup until the transaction that supersedes those files has durably committed. If the transaction fails or rolls back, the cleanup is discarded with it, so the previous index generation stays on disk and recovery falls back to it correctly. Operations that copy index files to a new location, like RENAME COLUMN or Parquet conversion, also receive a matching fix: they drop an uncommitted future index generation before linking, so the column points at the generation readers actually see rather than one that may be rolled back.
This fix enables rebuilding bricked or version-mismatched WAL sequencer _meta from _meta.0 and committed metadata sidecars. The transaction log's max metadata version is treated as the committed recovery boundary, so uncommitted sidecar tails are ignored and ahead-of-log _meta state is rolled back. During rebuild, the registry table token remains authoritative for table renames. RENAME_TABLE sidecars now only advance the recovered structure version, preventing abandoned or chained rename sidecars from changing the recovered table name.
Under concurrent query load with cross-query work-stealing, twap() and sparkline() could silently return wrong results during parallel GROUP BY. Both functions maintain per-slot native buffers of (key, value) entries and run a two-pointer merge at slot-combine time. The buffers are bound to slots rather than workers, and the cursor thread work-steals tasks across queries, allowing a single slot to accumulate frames in non-monotonic order. The two-pointer merge then ran on unsorted input and silently produced wrong values. This fix introduces SortedRunsMerge, a stable bottom-up pairwise mergesort over the natural sorted runs detected in the buffer via a single scan. TwapGroupByFunction.merge() and SparklineGroupByFunction.merge() now delegate to SortedRunsMerge.compactInto, while the read paths call SortedRunsMerge.compactInPlace to handle the case where no merge phase ran but work-stealing still left the buffer in multi-run state. The merge pass is allocation-free and adds only one extra linear scan for queries that never encounter the multi-run condition.
Queries joining on SYMBOL plus UUID or DECIMAL keys could fail with an UnsupportedOperationException instead of returning results. This fix delegates the missing record accessors so these joins work correctly.

May 18, 2026

QuestDB 9.4.0 introduces a compact, high-performance posting and covering index for SYMBOL columns, a local parquet metadata sidecar that unlocks row-group pruning, parallelised SAMPLE BY FILL with new cross-column FILL(PREV) syntax, three new window functions, and sparkline() / bar() text visualisations. It also delivers meaningful GROUP BY / hash-join speed-ups and fixes a number of correctness issues across the SQL planner, the WAL apply path, and the PGWire protocol.

New Features

This feature introduces a new posting index format for symbol columns with optional covering index support, storing selected column values in sidecar files alongside the posting list. The index can be created inline, out-of-line, or added via ALTER TABLE, for example: CREATE TABLE t (ts TIMESTAMP, sym SYMBOL INDEX TYPE POSTING INCLUDE (price, qty), price DOUBLE, qty INT) TIMESTAMP(ts) PARTITION BY DAY;. Row IDs are encoded per-key using delta + Frame-of-Reference (FoR64) bitpacking with adaptive selection between delta and flat modes, with native AVX2 decode for common bitwidths and a Java scalar fallback. When a query selects only the indexed symbol and INCLUDE columns, data is read directly from sidecar files without touching column data files. All column types are supported in INCLUDE, with ALP compression for DOUBLE/FLOAT, FoR bitpacking for integers, and FSST compression for VARCHAR/STRING. Supported query shapes include WHERE sym = 'X', WHERE sym IN (...), bind variables, LATEST ON ts PARTITION BY sym, and SELECT DISTINCT sym. Hints /*+ no_covering */ and /*+ no_index */ allow disabling the covering and index paths. Configuration properties include cairo.posting.index.row.id.encoding (adaptive/ef/delta) and cairo.posting.index.auto.include.timestamp. Benchmarks show the posting index is approximately 13.6x smaller and 1.3-1.5x faster for reads than BITMAP, at a ~9% write regression. The CAPACITY clause remains valid only for BITMAP indexes.
This feature introduces a compact binary _pm sidecar file that accompanies each data.parquet partition file, storing all metadata the query engine needs: column descriptors, QuestDB column types, per-row-group column chunk byte ranges, compression codecs, encodings, and min/max statistics. It replaces the JSON metadata blob previously embedded in the parquet footer's key-value section with a purpose-built binary format and enables row group pruning via locally-stored min/max statistics and bloom filter offsets without reading the parquet file itself. The format supports lock-free concurrent read/write via MVCC footer chaining, with the parquet file size in _txn field 3 serving as the version token. This lays the groundwork for cold storage, where the _pm file stays local and provides everything the query planner needs to decide which column chunks to fetch by byte range, eliminating metadata round-trips. Migration Mig940 generates _pm files for all existing parquet partitions on engine upgrade in a non-destructive manner. If a user rolls back to an older version, modifies parquet data, and re-upgrades, the migration can be re-run by setting cairo.repeat.migration.from.version in server.conf. SHOW PARTITIONS and ParquetMetaPartitionDecoder use the _pm file to extract metadata and decode row groups without parsing the parquet footer.
This feature moves SAMPLE BY FILL queries from the sequential cursor path onto QuestDB's parallel GROUP BY fast path where the fill mode is supported. The optimizer rewrites SAMPLE BY to GROUP BY timestamp_floor_utc(...), and a unified streaming fill cursor inserts gap-filled rows above the sorted GROUP BY output. New FILL(PREV(col_ref)) syntax enables cross-column PREV, where any slot in the per-column fill list may reference the previous value of another output column instead of its own. For example, a candlestick query can carry the prior bucket's close into the next bucket's open: SELECT ts, first(price) AS open, last(price) AS close FROM trades SAMPLE BY 1h FILL(PREV(close), PREV);. The source column must exist in the SELECT list with compatible types (full-type equality for DECIMAL, GEOHASH, ARRAY, TIMESTAMP, and INTERVAL), and cannot be the designated timestamp, another PREV reference, or a SYMBOL column. This improvement also fixes several issues: infinite fill loop with ALIGN TO CALENDAR WITH OFFSET + FILL without TO, sub-day SAMPLE BY + TIME ZONE + FROM/TO grid misalignment, SAMPLE BY FILL rejection on pre-1970 timestamps, and several memory leaks in shared sort infrastructure. FILL(LINEAR) and ALIGN TO FIRST OBSERVATION remain on the cursor path. Behavior change: FILL queries with FROM now apply effectiveOffset = FROM + OFFSET, unifying with non-FILL SAMPLE BY semantics.
This feature adds three new window functions. NTILE(n) distributes rows of an ordered partition into n approximately equal buckets and returns the 1-based bucket number. CUME_DIST() returns the cumulative distribution: rows at or before the current row (including peers) divided by total rows in the partition. NTH_VALUE(expr, n) returns the n-th value (1-based) within the current window frame, or NULL when n exceeds the frame size. All three support PARTITION BY and ORDER BY, and NTH_VALUE supports ROWS and RANGE frames (bounded and unbounded). In this initial release, NTH_VALUE accepts only DOUBLE first argument; LONG and TIMESTAMP overloads follow in a separate change. NTH_VALUE requires n to be a compile-time constant and rejects IGNORE NULLS / RESPECT NULLS, FROM FIRST, and FROM LAST. NTILE and CUME_DIST run in two passes and reject explicit ROWS / RANGE / GROUPS frame clauses. NTH_VALUE RANGE ... CURRENT ROW follows the QuestDB convention of not looking ahead to peer rows, which diverges from the SQL standard / PostgreSQL on tied ORDER BY values.
This feature extends NTH_VALUE window function with LONG-returning and TIMESTAMP-returning factories, mirroring the routing of the DOUBLE variant: per-partition / whole-partition / ROWS / RANGE bounded and unbounded frames, plus the lock-in fast path for UNBOUNDED PRECEDING ... K PRECEDING and the current-row 1-row frame. The TIMESTAMP variant overrides getType() to return the argument's type so TIMESTAMP_MICROS and TIMESTAMP_NANOS subtypes propagate intact. Both factories use Numbers.LONG_NULL as the sentinel for cases where n exceeds the current frame size and for genuine NULL inputs. As a side effect of adding multiple overloads, arity validation now shifts from the per-factory body into the overload resolver, producing a "no matching function" error instead of a "wrong number of arguments" error.
This feature introduces two new functions for inline text visualization. The sparkline() aggregate function collects numeric values within a GROUP BY and renders them as a Unicode trend line using block characters (▁▂▃▄▅▆▇█), pairing naturally with SAMPLE BY to show intra-bucket trends. It supports auto-scaling, explicit min/max bounds, clamping, NULL handling, and width-based sub-sampling. The bar() scalar function renders a single numeric value as a horizontal bar (▏▎▍▌▋▊▉█) proportional to a given range, working with aggregates like sum() and window functions like min() OVER () for auto-scaling, with fractional block precision of 8 levels per character. Both functions return VARCHAR and work across all clients including psql, Web Console, JDBC, and CSV. Output size is bounded by cairo.sql.string.function.buffer.max.size.
This feature propagates PARQUET_ENCODING(...) for pass-through projected columns through the streaming Parquet export path, so COPY (SELECT ...) TO parquet and /exp preserve the configured encoding instead of silently falling back to defaults. Previously, projected query metadata dropped parquetEncodingConfig for pass-through columns, causing the Parquet writer to only see the default encoding when running through the streaming/page-frame path. The fix is intentionally scoped to pass-through columns; computed columns still derive their Parquet behavior from the projected type rather than the source column's encoding config. The writer-side encoder layout in parquet_write has also been cleaned up to mirror parquet_read more closely, with one top-level dispatch entrypoint and smaller encoding-family modules under parquet_write/encoders/.
This feature generates a starter CREATE MATERIALIZED VIEW statement from a table or an existing materialized view and inserts it into the editor for the user to tweak. From a table, the menu item is disabled for non-WAL tables and tables without a designated timestamp; the view name follows <table>_<sample>, replacing existing period suffix (e.g. my_table_5m → my_table_1h); REFRESH IMMEDIATE, SAMPLE BY, PARTITION BY, and TTL are inferred per QuestDB's default inference; column aggregates use sum/last by name pattern, with types lacking a LAST() overload (BINARY, LONG128, INTERVAL, arrays) dropped. From a materialized view (downsample), SAMPLE BY is stepped one ladder rung up; WITH BASE is re-rooted at the source mat view; aggregate args are rewritten to layer-1 aliases; COUNT() becomes SUM; COUNT(DISTINCT …) is dropped; non-decomposable aggregates fall back to LAST(); WHERE, GROUP BY, and LATEST ON are stripped; REFRESH, PERIOD, and OWNED BY are preserved; and TTL is stepped one rung up. Generation errors surface through the toast instead of being swallowed.

Improvements

This improvement introduces batched aggregate dispatch for the parallel keyed GROUP BY path. Instead of per-row virtual dispatch through GroupByFunctionsUpdater, the reducer splits each page frame into sub-batches (2048 rows by default, configurable via cairo.sql.parallel.groupby.batch.size) and processes them in two phases: a probe phase that finds or creates map entries and packs row index plus entry offset into a scratch buffer, and an update phase that calls computeKeyedBatch() once per function per sub-batch. Hot functions including count, sum, min, max, avg, bit_and, bit_or, and bit_xor override with tight Unsafe loops that skip the per-row MapValue dispatch. Benchmarks show speedups ranging from 1.16x to 4.60x on single-column count() queries, with per-function isolation tests showing 2.71x-3.03x improvements across aggregates. This also fixes a pre-existing data-correctness bug in count(uuid) where CountUuidGroupByFunction read the high 64 bits of the UUID argument twice, causing any UUID whose high half equals Long.MIN_VALUE to be silently excluded from the count.
This improvement consolidates QuestDB's non-Murmur 64-bit hash paths onto a single finalizer derived from xxHash3's 64-bit avalanche, and raises the load factor of the affected hash tables from 0.5 to 0.7. The new xxh3Avalanche64 mixer is roughly 1.7x faster than fmix64 in latency mode while matching its quality tier, allowing the removal of the weaker fastHashInt64 and fastHashLong64 FxHasher-based mixers. The six former fastHash* callers had 0.5 load factors chosen to absorb FxHasher's weak avalanche; with the stronger mixer they run denser at the project's standard 0.7. The cairo.sql.count.distinct.load.factor default is also raised. An incidental fix corrects DirectLongHashSet.rehash() arithmetic that was leaving the set operating past its configured load factor. ClickBench shows a 1.21% total time improvement across 43 queries, with single-column count() queries on Unordered4Map/Unordered8Map keys improving up to 25.7%.
This improvement extends the parallel top-K gate in SqlCodeGenerator#generateOrderBy so it also fires when a column-projection wrapper (SelectedRecordCursorFactory or VirtualRecordCursorFactory) sits between the ORDER BY ... LIMIT N and the filtered page-frame scan. Previously, any non-literal SELECT list combined with a WHERE clause forced the query onto the generic Sort light path, materializing and sorting every matching row instead of keeping a bounded heap of N. The implementation adds two default methods on RecordCursorFactory — translateOrderByColumnToBase and rewrapOverTopK — which projection wrappers override to peel themselves, allow top-K to apply to the inner factory, and re-wrap the output. An incidental fix adds explicit getColumnCrossIndex() != null guards in generateJoinAsof's slave-projection peels to prevent silently dropping the VirtualRecord layer's column translation.
Comparing a TIMESTAMP column to a runtime-constant value (for example a string bind variable sent over PostgreSQL Wire Protocol) used to re-parse that value on every row, which could dominate CPU time on large scans via repeated NumericException throws inside the implicit string-to-timestamp cast. This improvement evaluates the constant or bind-variable side once at query init, regardless of precision or which side of = it appears on. There is a small behavior change: some EXPLAIN plans will show timestamp equality arguments swapped, for example instead of filter: (ts=123) there will be filter: (123=ts). This is acceptable given EXPLAIN plans are not APIs and have no stability guarantee.
Lateral join decorrelation previously duplicated the outer query model for each correlated reference. This improvement replaces deep-cloning with lightweight QueryModelWrapper references that share the same underlying model, and introduces SharedRecordCursorFactory so the shared model executes only once at runtime. The wrapper/shared-cursor mechanism is designed as a general framework for materializing common table expressions, allowing future CTE and view support to reuse it to eliminate duplicate execution across multiple references. Shared cursor support covers all GROUP BY factory variants, including vectorized and async variants, as well as pass-through via SelectedRecordCursorFactory.
This improvement replaces JSON structured output with plain text responses plus a suggest_query tool for SQL suggestions, and adds thinking/reasoning content display for reasoning models. All JSON response format schemas, custom provider JSON parsing/repair logic, and extractPartialExplanation have been removed, along with the responseStart, explanation, and contentFragments fields from conversation messages. Full tool call and result history is now included in conversations, giving the model context of prior tool interactions across turns, and assistant responses are grouped with their tool calls in the UI on a turn basis. The max response token limit has been raised to 64,000 for Anthropic models, and summary generation for context compaction now uses streaming. Issues around running queries from the chat window when the editor is unmounted and query key mismatches between chat and editor runs have been resolved, unnecessary rerenders and layout issues have been prevented, and cancellation/abort handling has been added for queries run from the AI Chat Window.

Bug Fixes

WAL apply could store the bitmap index max row as the exclusive end when appending to an indexed SYMBOL column in a non-last partition. The next apply then treated the index as one row ahead and rolled it back before indexing new rows. For users with high-cardinality indexed symbols, this added unnecessary index scan work and could increase WAL apply latency. This fix stores the bitmap index max row as an inclusive row id and clarifies the row-id adjustment in the O3 index update path.
Malformed SQL with inner queries containing an empty LIMIT clause, such as DECLARE @pair := (SELECT symbol FROM fx_trades LIMIT ) SELECT ..., could trigger an internal error. The parser was allowing a nested expression parse to consume operands from the outer declaration expression, corrupting the assignment AST before parseDeclare() inspected it. This fix isolates ExpressionTreeBuilder operand stack frames across reentrant parseExpr() calls, so nested parses cannot consume outer operands, and rejects an empty LIMIT clause with a clear parser error: 'limit expression expected'.
This fix resolves an internal error in ASOF/LT joins when a subquery shifts the designated timestamp with dateadd() and that timestamp is later pruned from the visible projection. For example, an ASOF JOIN with a subquery selecting dateadd('s', -30, timestamp) AS timestamp could have the optimizer remove the shifted timestamp from the output while join planning still required it, causing timestamp metadata to be lost and query compilation to fail with an AssertionError. This fix preserves and restores the hidden derived timestamp needed by the join, resolves timestamp aliases case-insensitively after pruning, and validates restored timestamp expressions.
This fix tightens the entity check in generateSelectChoose so that a select-choose model with an explicit timestamp(...) clause is no longer incorrectly elided when projection columns are renamed. Previously, queries that renamed a designated timestamp column (e.g. timestamp AS ts) inside a CTE combined with SAMPLE BY could fail with SqlException: Invalid column: ts or trigger an internal assertion error. The fix now requires the child column count to match the projection count and verifies each projected column's token and alias align with the child's column name before treating the model as an entity.
This fix corrects the bucket timestamp emitted by SAMPLE BY 1d ALIGN TO CALENDAR TIME ZONE on the DST-start day when the preceding day had no data. Previously, in AbstractNoRecordSampleByCursor.nextSamplePeriod, after adjustDst promoted tzOffset to the post-transition value, the existing compensation subtracted the whole current offset rather than the delta between the current and pre-transition offsets, causing the bucket to be back-converted with the wrong side of the transition. The fix uses the offset valid at the bucket boundary and shifts localEpoch by the delta between the cursor's current tzOffset and that boundary offset, mirroring timestamp_floor_utc. Aggregation results were already correct; only the emitted bucket-start timestamp was wrong.
This fix hardens several code paths in the PostgreSQL Wire Protocol implementation that previously trusted length and dimension fields read from the wire without validation. In the pre-authentication handshake, PGCleartextPasswordAuthenticator accepted any signed int as msgLen, allowing negative values to drive pointer arithmetic outside the receive buffer; lengths are now required to sit between the protocol minimum and the receive buffer capacity. DefaultPGCircuitBreakerRegistry.cancel had an off-by-one comparison and read registry state outside the spin lock; the cancel path now runs under the lock and rejects the -1 sentinel. In the post-authentication extended query protocol, array dimension multiplication in PGNonNullBinaryArrayView and PGNonNullVarcharArrayView could wrap flatViewLength via signed integer overflow; these now use Math.multiplyExact, and PGPipelineEntry validates valueSize, dimension counts, and dimension sizes before each field read.
This fix addresses window functions nested inside arithmetic expressions (e.g. avg(x) - avg(x) OVER ()) which silently produced wrong results when combined with GROUP BY. The existing check in SqlOptimiser.rewriteSelect0 only inspected status flags on top-level window columns and missed window functions buried inside operators, function calls, or CASE branches. A new findWindowFunctionOutsideAggregatePos walker traverses each SELECT column's AST and returns the position of the first window function not nested inside an aggregate, skipping aggregate subtrees to keep shapes like max(avg(x) OVER (...)) legal while rejecting unrepresentable windows mixed with GROUP BY. An additional findInvalidAggregateOverWindowPos guard rejects wrapping an aggregate-over-window in operators or other terms with a clear error pointing users to use a sub-query. The walker also handles pure window function names used without OVER (e.g. row_number()) that previously fell through to a cryptic runtime error.
This fix resolves queries like max(avg(x) OVER (...)) GROUP BY category — a window function used as the argument of an aggregate combined with explicit GROUP BY — which previously failed with a confusing Invalid column error. The rewrite pipeline inserts an inner window model between groupByModel and translatingModel whenever an aggregate argument contains a nested window function, but the existing pass-through logic only propagated columns from the top-level window model, leaving the inner window model without the GROUP BY keys or base columns referenced from aggregate arguments. The fix adds a dedicated pass after inner window models are populated, walking groupByModel's columns and later inner window models' OVER clauses, and propagates referenced literals through the chain with an owner-index filter. Coverage includes simple literal keys, positional references, expression keys like GROUP BY upper(cat), multiple keys, multiple aggregates with distinct nested windows, and PARTITION BY on columns outside the GROUP BY set.
The WalPurgeJob could process a table token snapshot after the table had been renamed and the original name reused, resulting in a stale-token exception that crashed server-main. This fix treats the resulting exception as a retryable purge race and leaves the current registry state for the next purge pass.
The parser previously misinterpreted an AND inside a sub-expression as BETWEEN's AND operator, causing queries such as SELECT 1 BETWEEN (1 and 2) and 3 or SELECT 1 BETWEEN ARRAY[0 AND 1] AND 2 to fail. This fix tracks the scope depth when BETWEEN is encountered and uses the current scope depth to determine whether a subsequent AND belongs to BETWEEN or to the sub-expression. The same approach generalises the previous special-case handling for BETWEEN-CASE.
A JIT-compiled filter using a UUID bind variable alongside any other bind variable produced wrong results — usually zero rows, sometimes spurious matches — because the JIT addressed bind-variable slots at an 8-byte stride while the Java side wrote 16 bytes for a UUID. With a UUID slot followed by another bind, the second 8 bytes of the UUID shadowed the next bind, which then decoded garbage. The bind-variable area now uses a fixed 16-byte stride end-to-end: UUID slots are unchanged, and non-UUID slots get an 8-byte zero pad. Filters with a single UUID bind variable were already correctly addressed and continue to work.
This fix addresses several independent SQL planner issues. An ON predicate where both sides reference the slave table of a LEFT JOIN or ASOF JOIN (for example y.a = y.b) was silently discarded; the predicate is now routed to the join's outer-join expression clause. The boolean NOT optimiser skipped the UNION branch when a model contained both a union branch and a nested model, so NOT (a > b) was only rewritten on one side of a UNION ALL; it now descends into both. The SqlParser tableNamePositions and tableNames fields were promoted from static to instance fields so concurrent parser instances do not share mutable state. A potential double-free in SqlCodeGenerator.compileFilter was also removed.
Valid window queries using EMA, VWEMA, or KSUM could fail when the planner chose cached window execution. These functions already worked in streaming execution, but some query shapes require cached execution, for example when an incremental window function is used together with another window expression. In those cases, users could hit an execution failure even though the SQL was valid.
This fix corrects tables().table_write_amp_p50/p90/p99/max and tables().wal_dedup_row_count_since_start which could report values much larger than the per-commit log line for the same workload, sometimes by several orders of magnitude, when a WAL apply job interleaved data writes with non-data transactions. Two TableWriter counters (physicallyWrittenRowsSinceLastCommit and dedupRowsRemovedSinceLastCommit) were only reset on certain branches, so iterations taking non-resetting branches re-read the previous iteration's value when accumulating physical row counts and dedup counts. The fix resets both counters at the start of every processWalCommit call so each iteration's reads only see that iteration's writes, regardless of branch. As a side effect, the wal_apply_physically_written_rows Prometheus counter no longer over-counts on skip/no-op iterations, and sys.telemetry_wal.physicalRowCount is no longer attributed to skipped transactions.
After an Execute suspends a portal, its cursor is intentionally retained so the next Execute on the same portal can resume from the same row. However, when the client sent a Sync between that suspended Execute and a follow-up Bind for the same statement, the existing pre-lookup guard in msgBind no longer fired, and the subsequent lookup re-introduced the suspended entry with its cursor still alive. As a result, msgExecuteSelect skipped acquiring a fresh cursor and returned a partial result set. This fix adds a post-lookup guard in msgBind that always closes any suspended cursor on the entry being rebound before starting a fresh execution, matching PostgreSQL protocol semantics. Close, Describe, and Execute on unrelated entities continue to preserve suspended cursors.
When ARG_MAX() or ARG_MIN() was applied to a TIMESTAMP_NS column, the returned value was rendered with microsecond precision, producing dates thousands of years in the future (e.g. 54977-04-25T06:29:47.654321Z instead of 2023-01-03T12:34:56.987654321Z). The underlying long was a nanosecond value but the function reported its result type as TIMESTAMP (microseconds), causing the formatter to scale it incorrectly. This fix updates the six affected ARG_MAX/ARG_MIN variants to derive the timestamp type from the input column via ColumnType.getTimestampType(), mirroring the existing pattern used by MAX() and FIRST(). The internal storage layout is unchanged; only the reported column type and resulting display formatting differ. There is no behavioural change for TIMESTAMP (micros) inputs.
This fix resolves a rare live lock where the WAL apply thread could spin indefinitely in WalTxnDetails.readObservableTxnMeta when reading transaction metadata. The issue occurred in an edge-case commit pattern where transactions were added very quickly, causing the metadata read loop to always find new entries and never complete.
When TableSequencerImpl opens _wal_index.d and encounters ENOENT, it tries to translate the error to a dropped-table exception so callers can handle drop races gracefully. The previous translation used engine.isTableDropped(tableToken), which only returned true during the drop-to-purge window. Once WalPurgeJob swept the reverse-map entry, the original ENOENT propagated as a CRITICAL-level exception, causing concurrent inserts racing a DROP TABLE to fail unexpectedly. This fix switches the translation predicate to check whether the table token lookup by directory name returns null, covering both stages of drop in a single condition. The semantics for other callers remain unchanged.
The fast symbol-keyed path in WindowJoinFastRecordCursorFactory blindly cast the master cursor's getSymbolTable() result to StaticSymbolTable. When the master factory wrapped the symbol through a SymbolColumn (e.g. a CTE or sub-select that projected the symbol through a VirtualRecord), the cursor returned the SymbolColumn function itself, which is not a StaticSymbolTable, causing the join to fail with a ClassCastException. This fix unwraps the projected SymbolFunction down to its underlying StaticSymbolTable and tightens the plan-time gate in SqlCodeGenerator so the symbol-keyed fast path is selected only when both columns expose a static symbol table.
The JSON query endpoints emitted CHAR values as a raw char between two ASCII quotes with no escaping. When a CHAR row contained " (0x22), the response carried three literal quotes; for \ (0x5C) the closing quote was escaped away, leaving the JSON object unterminated; for any C0 control byte the raw byte was inlined verbatim. In every case strict JSON parsers failed mid-response, so any client running a SELECT over a table with such values would see a parser error instead of data. This fix ensures CHAR values are now escaped using the same rules as string values: backslash-prefix for " and \, the short \b/\f/\n/\r/\t form for the usual control chars, \uXXXX for the rest, and pass-through UTF-8 for everything from U+0020 upward.
This fix bumps @questdb/sql-parser so autocomplete now picks function categories from the cursor's grammar context instead of dumping the full function list at every identifier position, and implicit select statement suggestions are included in the statement start. It also fixes a hanging editor when there is a trailing comma in the select list. For example, typing t| now suggests trades, tables, wal_tables, etc.; SELECT * FROM trades ASOF JOIN m| now returns 3 relevant entries instead of 332 mixed results; SELECT * FROM trades WHERE price = c| now returns scalars only instead of mixing scalars and aggregates; and INSERT INTO trades VALUES (n| now returns relevant suggestions like now, now_ns, nullif instead of nothing. Manual resolutions for lodash and lodash-es have been added to resolve vulnerability scan issues.
This fix corrects a user-visible typo in the Web Console error message, changing 'An error occured, please try again' to use the correct spelling 'occurred'.

April 13, 2026

QuestDB 9.3.5 introduces lateral joins, SQL-standard UNNEST, statistical window functions, and corrects SAMPLE BY timezone handling during DST transitions. It also delivers multiple join performance improvements and important Parquet export fixes.

Breaking Changes

This fix introduces a new timestamp_floor_utc function that floors timestamps in local time and converts back to UTC internally, replacing the previous approach of wrapping the query in an extra to_utc() conversion model. For sub-day strides, it uses the standard (non-DST) timezone offset to keep bucket widths uniform in UTC space, avoiding ambiguity during fall-back transitions. DST fall-back (clocks go back) now produces two rows instead of one for the repeated local hour — previously both passes through the repeated hour were merged into a single bucket. Output timestamps can be non-monotonic in local time during fall-back, though the underlying UTC timestamps remain monotonic. Sub-day bucket boundaries are now uniform in UTC rather than in local time. These changes do not affect queries without TIME ZONE, queries with fixed-offset timezones, or super-day strides (day, week, month, year). The fix also corrects an offset-in-floor-anchor bug where DST-aware and fixed-offset paths produced different bucket assignments, fixes the materialized view refresh iterator for sub-day DST timezones with offset, and resolves a native memory leak in FillRangeRecordCursorFactory.

New Features

Users can now declare bloom filters as part of column metadata via the existing PARQUET() clause in CREATE TABLE and ALTER TABLE statements. Previously, bloom filter columns had to be specified each time a partition was converted to parquet. The PARQUET() clause accepts an optional trailing BLOOM_FILTER keyword, which can be used as the sole argument, combined with encoding, or with both encoding and compression (e.g., PARQUET(DELTA_BINARY_PACKED, ZSTD(3), BLOOM_FILTER)). The bloom filter flag is stored in bit 25 of the existing 32-bit parquetEncodingConfig field, requiring no on-disk format change. When convertPartitionNativeToParquet runs without an explicit bloom_filter_columns override, the TableWriter scans per-column metadata for the bloom filter flag automatically. An explicit bloom_filter_columns in CONVERT PARTITION WITH(...) still overrides metadata flags. SHOW CREATE TABLE renders the flag in lowercase (bloom_filter). SET PARQUET(...) replaces the entire parquet config for the column, so users must re-specify BLOOM_FILTER when changing only the encoding. FPP (false positive probability) remains a global setting (partition.encoder.parquet.bloom.filter.fpp) and is not configurable per-column. The vendored parquet-format-safe crate is patched to include bloom_filter_length on ColumnMetaData, improving compatibility with readers that rely on this field to locate bloom filter boundaries.
This feature introduces a full suite of statistical aggregate window functions. STDDEV_POP(), STDDEV_SAMP(), and STDDEV() compute standard deviation, while VAR_POP(), VAR_SAMP(), and VARIANCE() compute variance, reusing the standard deviation base via an isSqrt flag. COVAR_POP(), COVAR_SAMP(), and CORR() provide bivariate statistical analysis through a new bivariate abstract base class. All functions support all frame modes including ROWS, RANGE, PARTITION BY, and unbounded/bounded frames, as well as two-pass whole-partition variants. Non-removable frames use Welford's online algorithm for numerical stability. STDDEV is an alias for STDDEV_SAMP(), and VARIANCE is an alias for VAR_SAMP(). A fix to FunctionParser was also included to correctly resolve group-by vs window factory when an argument requires implicit cast.
This feature introduces a new server property cairo.metadata.cache.snapshot.ordered (default: false) that, when enabled, causes all table-listing functions to return rows sorted alphabetically by table name. The sort is maintained incrementally inside a new CharSequenceObjSortedHashMap data structure backed by a CharSequenceSortedList, which inserts at the correct sorted position using binary search. This means lookup remains O(1) via hash and ordered iteration is O(n), with no post-query sort step required. A new CharSequenceObjMap interface allows callers to hold either the sorted or unsorted implementation transparently. The feature applies to SHOW TABLES, all_tables(), and related table-listing functions, and correctly handles drop-and-recreate as well as rename scenarios.
Tick expressions like [2024-01, 2024-02]T09:30@America/New_York#workday;6h29m now work without requiring explicit day ranges ([2024-01-[01..31]]). The compiler detects month-level and year-level date elements with time override suffixes and expands them to individual days before applying the time. This also works with bracket ranges (2024-[01..02]T09:30), time list brackets (T[09:00,14:00]), bare expressions without brackets (2024-01T09:30), year-level dates ([2024]T09:30), and duration plus day filter combinations ([2024-01]#workday;6h29m). Month validation was added before getDaysPerMonth to prevent ArrayIndexOutOfBoundsException on invalid month values. A parseMonthLevelDate helper was extracted to deduplicate YYYY-MM parsing and validation. Heap allocation of long[] dayStarts was eliminated by compacting day starts to the front of the existing LongList and reusing positions after them for per-day parsing.
This feature introduces UNNEST as a FROM-clause operator supporting all typed array column types, multiple arrays with NULL padding, and WITH ORDINALITY. For JSON arrays, a COLUMNS(name TYPE, ...) syntax declares output columns from VARCHAR-stored JSON arrays. Supported JSON column types include DOUBLE, LONG, INT, SHORT, BOOLEAN, VARCHAR, and TIMESTAMP. UnnestRecordCursorFactory wraps the base factory and emits one output row per array or JSON element. The UnnestSource interface abstracts the difference between typed arrays (ArrayUnnestSource) and JSON arrays (JsonUnnestSource), keeping cursor and record logic shared. JSON field extraction uses simdjson via JSON Pointer queries, and a native truncated flag detects values exceeding the 4KB extraction limit, throwing an error instead of silently truncating. Example usage: SELECT u.price, u.name FROM events e, UNNEST(e.payload COLUMNS(price DOUBLE, name VARCHAR)) u. Typed array and JSON sources can be mixed in the same UNNEST call.
This feature adds lateral join support, allowing subqueries in the FROM clause to reference columns from preceding tables. Correlated lateral subqueries are decorrelated at the optimizer into standard joins, enabling set-based execution instead of per-row nested-loop evaluation. The decorrelation technique is based on the Neumann and Kemper "Unnesting Arbitrary Queries" approach. The rewriter (LateralJoinRewriter) runs three passes during SQL optimization: correlation analysis that tags literals referencing outer tables, decorrelation that builds deduplicated outer-reference subqueries and rewrites correlated references, and an elimination pass that attempts to remove outer-reference join models when all correlations resolve to equalities. Operator-specific compensation preserves semantics for GROUP BY, SAMPLE BY, window functions, DISTINCT, LATEST BY, LIMIT, and set operations. Supported syntax includes JOIN LATERAL for inner lateral joins, LEFT JOIN LATERAL for left lateral joins with NULL fill, and standalone LATERAL for implicit cross joins. Example: SELECT o.id, t.total FROM orders o JOIN LATERAL (SELECT sum(qty) AS total FROM trades WHERE order_id = o.id) t.
ALTER TABLE ADD COLUMN and ALTER TABLE DROP COLUMN now work correctly on tables whose partitions have been converted to Parquet format. Previously, schema changes on Parquet-stored partitions could produce corrupted files or query errors because the Parquet column layout was assumed to match the current table schema. Added columns read as all-NULL at read time, while dropped columns' data remains in the file until an O3 merge rewrites the affected row groups. During O3 merge, rewritten row groups reflect the current schema: added columns get all-NULL chunks, and dropped columns are omitted. Bitmap indexes are rebuilt when converting between native and Parquet formats if the index files are missing or stale. The WAL writer correctly marks new symbol columns as nullable when uncommitted rows exist at the time of ADD COLUMN.
This feature extends HORIZON JOIN to accept multiple right-hand-side (slave) tables in a single query, enabling users to aggregate columns from several time-series sources against a common master table and offset grid in one statement. Both single-threaded and parallel (page-frame-based) execution paths are supported for multi-slave HORIZON JOIN, including keyed (ON symbol) and non-keyed (timestamp-only ASOF) variants, as well as mixed keyed/non-keyed slaves within the same query. The last HORIZON JOIN in the chain carries the RANGE/LIST and AS clauses; preceding HORIZON JOIN clauses omit them. For example: SELECT avg(b.bid) AS avg_bid, avg(a.ask) AS avg_ask FROM trades AS t HORIZON JOIN bids AS b ON (t.sym = b.sym) HORIZON JOIN asks AS a ON (t.sym = a.sym) LIST (-2s, 0, 2s) AS h GROUP BY h.offset. Internal refactoring replaces LongList offsets with long[] to avoid on-heap allocations on the hot path, and replaces the always-forward scan strategy with an adaptive backward/forward approach controlled by configurable thresholds (cairo.sql.horizon.join.bwd.scan.* properties).
This feature adds RLE dictionary decoding for STRING columns in the Parquet reader by wiring the existing BaseVarDictDecoder, RleDictionarySlicer, and StringColumnSink together in decode_byte_array_dispatch. The default writer encoding for VARCHAR columns has been changed from RleDictionary to DeltaLengthByteArray, aligning it with STRING and Binary defaults.
This feature introduces the arg_max(varchar, key) aggregate function for key types: timestamp, double, long, and int. The function returns the varchar value corresponding to the row where the key column reaches its maximum. Each variant supports parallel group-by execution (Async Group By) with per-worker function cloning and pointer-based merge, using StableAwareUtf8StringHolder for efficient off-heap varchar storage that avoids copying on every new max. NULL key rows are skipped and do not affect the result, while a NULL varchar value at the max key is correctly returned as NULL.

Improvements

This improvement extends the convertSymbolJoinKeysToInt optimization, previously used only by AsOf/LT joins, to hash joins. Static SYMBOL-to-SYMBOL key pairs are now compared as integers rather than strings, with a SymbolTranslatingRecord wrapping the build-side record to translate build-side symbol IDs into the probe-side encoding. Translation runs only on the build side, so the probe side stays on a straight getInt() path with zero added overhead. All six hash join variants (inner/left/right/full outer × light/full-fat) are covered.
Queries on wide tables that reference only a subset of columns now open partitions faster. TableReader tracks which columns the current query needs via a BitSet and skips memory-mapping inactive columns when opening partitions, reducing mmap/munmap syscall overhead proportionally to the number of unreferenced columns. For a query touching 1 column out of 100, roughly 99% of mmap calls are avoided per partition open. The active column set is deduplicated and falls back to mapping all columns when the list is null, empty, or covers every column. When the active set broadens, already-open partitions map any newly-needed columns. On pool return, goPassive() clears the active column state so subsequent opens map all columns. When no active columns are set (Parquet export, direct callers), the reader maps all columns as before.
This improvement eliminates eager opening of all partitions at query start for ASOF JOIN, HORIZON JOIN, and WINDOW JOIN queries on large slave tables. Time frame cursors now pre-compute exact page frame boundaries from table metadata (column tops, row counts, partition formats) without opening partitions. Actual partition opening happens lazily on first access via ensurePartitionOpened(), so queries targeting a narrow time range skip I/O for untouched partitions. Both single-threaded (TimeFrameCursorImpl) and concurrent (ConcurrentTimeFrameState + ConcurrentTimeFrameCursorImpl) cursors follow a two-phase approach: an upfront phase that pre-computes page frame boundaries from partition metadata stored as UninitializedPageFrame entries, and a lazy phase that opens partitions and patches zero-address entries with real mmap addresses. Fallback paths that bypass lazy opening include partitions already open in the table reader, cursors with interval filters, and Parquet partitions. The concurrent path uses double-checked locking with AtomicIntegerArray for safe partition opening across worker cursors. Queries accessing a subset of partitions avoid opening irrelevant partitions entirely, with zero-GC on the hot path.
Previously, calculateInsertTransactionBlock() excluded the last loaded INSERT from the transaction block, forcing it to be processed alone. This improvement includes it in the block, allowing consecutive INSERT operations to be committed together via processWalCommitBlock() instead of one-by-one via processWalCommit(). The previous behavior caused unnecessary LAG usage on empty tables (e.g., post-TRUNCATE with few rows), creating artificial 0-row partitions that could race with backup compression. The root cause was a backward loop that assigned FORCE_FULL_COMMIT to the last loaded INSERT, and the block calculation loop broke before incrementing blockSize, excluding the transaction. The fix increments blockSize before the break, which is safe because the backward loop propagates FORCE_FULL_COMMIT from structural changes (ALTER, TRUNCATE) to the INSERT before them, so the block loop never reaches a non-data transaction.
ASOF JOIN and LT JOIN factories with multi-key symbol columns now compare join keys as integers instead of converting symbol IDs to strings. SymbolTranslatingRecord translates master symbol IDs to slave symbol IDs via a cached IntIntHashMap, enabling integer-based map lookups and memeq() comparisons instead of variable-length string hashing and comparison. This optimization applies to AsOfJoinFastRecordCursorFactory, AsOfJoinDenseRecordCursorFactory, AsOfJoinLightRecordCursorFactory, FilteredAsOfJoinFastRecordCursorFactory, and LtJoinLightRecordCursorFactory. Single-symbol joins retain their existing specialized paths. SymbolTranslatingRecord gains a hadNonExistentKey() flag set during getInt() calls, eliminating a separate hasNonExistentKey() pre-scan. Benchmarks on a 100M master / 50M slave row dataset show multi-key ASOF JOIN improving from 8.11s to 5.2s.
This improvement speeds up constant-index element access on 1D and 2D double array columns (e.g. arr[1], arr[3], arr[2,1]) by bypassing full ArrayView construction. The hot path now reads directly from AUX/data pages via Unsafe, skipping IntList operations, stride computation, and BorrowedFlatArrayView setup that BorrowedArray.of() performs per row. A new Record.getArrayDouble1d2d() method with an optimized override in PageFrameMemoryRecord takes two zero-based indices and dispatches internally between the 1D and 2D paths with a single Unsafe.getDouble return. DoubleArrayAccessFunctionFactory detects constant positive indices on 1D/2D column functions at compile time and routes through the fast path. Negative indices and 3D+ arrays fall through to the existing getArray path. Benchmarks on a 10M-row table show approximately 2.4x speedup for queries like SELECT sum(arr[1]) FROM t.
This improvement adds a table selector to the table details drawer header when there is no table details history, reusing the same selector as the metrics table selector. It also includes styling updates for the metrics header (table name, title, actions) and layout updates to the metrics dashboard.

Bug Fixes

After DROP COLUMN, a column's reader index (dense position in the live column list) diverges from its writer index (permanent ID stored as field_id in parquet files). Several code paths confused the two, producing type mismatch errors, corrupt bitmap indexes, or suspended WAL tables. This fix addresses five distinct issues: bitmap index rebuild using the wrong parquet column during checkpoint/backup recovery; a doubled parquet file path that silently skipped bitmap index rebuilds; a race condition where parallel parquet bitmap rebuild tasks read stale metadata when recovering multiple tables; COPY TO parquet export flagging the wrong column as designated timestamp after DROP COLUMN; and ALTER TABLE ALTER COLUMN TYPE crashing on tables with parquet partitions by converting parquet partitions back to native before the column type conversion starts.
This fix allows signed duration segments when parsing and compiling tick expressions (e.g., timestamp in '2026-01;-3d'). Intervals are now normalized when durations move backwards so that static and compiled paths produce matching results. The compiled path in emitSingleVar previously produced inverted intervals for non-day-level variables like $now;-1h. Additionally, addYears day-of-month clamping was corrected in Micros, Nanos, and Dates so that Feb 29 + 1y now correctly produces Feb 28 instead of Mar 1. Calendar drift in repeating month/year intervals is fixed by computing each interval from the base timestamp instead of iterating. Non-positive counts in repeating interval syntax are now rejected with a clear error message.
The projection self-reference check in doReplaceLiteral0() only inspected the first join model, missing columns from other join models. This caused "Invalid column" errors when a function wraps a column from a joined table (e.g., coalesce(c, 0) where c comes from a secondary join model) and table-prefixed columns make the translating model non-redundant. This fix replaces the single-model check with columnNotExistsInJoinModels() that scans all join models.
This fix addresses three related issues. First, CopyExportRequestTask incorrectly treated var-size columns as fully NULL when their data was entirely inlined into the aux vector, because the colTop heuristic checked the data page address (which is legitimately 0 for inlined varchars). It now checks the aux page address for var-size columns. Second, the streaming Parquet reader now supports multiple dictionary pages per column chunk by allocating a fresh buffer per varchar-slice dict page, preventing aux entries decoded against earlier dict pages from being corrupted when later ones are decompressed. Third, the ASCII flag is now normalized when serializing VARCHAR keys in SingleRecordSink and OrderedMap, so two equal varchars whose ASCII-flag provenance differs hash and compare as the same key. Without this normalization, GROUP BY, DISTINCT, and hash-join operations could treat visually identical values as distinct groups.
This fix corrects a bug in convertSymbolJoinKeysToInt() where writeSymbolAsString bits were unset while iterating over join key columns. When master and slave symbol columns had crossing indices (e.g., master.symbol1=col2, slave.symbol2=col2), the first pair's unset operation cleared the bits needed by the second pair, leaving keyTypes as STRING while the RecordSink generated getInt() code. This type mismatch produced garbled map keys and missed matches.
When DirectLongLongHashMap.restoreInitialCapacity() or rehash() encountered an allocation failure (OOM or RSS memory limit exceeded), the map was left in an inconsistent state where capacity > 0 but ptr = 0. A subsequent FastGroupByAllocator._close() call would iterate over the capacity and dereference the null pointer, causing a SIGSEGV. This fix moves capacity, mask, and ptr updates to after the allocation succeeds, so a failed malloc leaves the map in its prior consistent state.
The ALTER TABLE ALTER COLUMN TYPE statement now supports converting between DECIMAL and VARCHAR/STRING types. The type conversion validation matrix in SqlCompilerImpl was updated to include these entries. Four new conversion methods handle the data transformation: DECIMAL to VARCHAR/STRING reads each value through a loader, sets the scale on a thread-local Decimal256, and writes the string representation; VARCHAR/STRING to DECIMAL parses each string through Decimal256.ofString() and stores the result. NULL values round-trip correctly in both directions. All six DECIMAL storage sizes (DECIMAL8 through DECIMAL256) are supported for both conversion directions. The fix also corrects VARCHAR aux-vector sizing in the shared fixed-to-VARCHAR conversion path.
This fix resolves an issue where clients using the PostgreSQL extended query protocol (such as postgres.js) to iterate cursors in batches via unnamed portals would receive a "spurious execute message" error on the second Execute. The root cause was that after a Flush, the suspended cursor was freed for unnamed portals, the factory was moved into the cache, and the pipeline entry was released to the object pool, leaving pipelineCurrentEntry null when the next Execute arrived. The fix retains suspended unnamed portal pipeline entries across Flush/Sync so the next Execute can resume iteration, adds a stateSuspended flag to track portal suspension state accurately, and frees suspended cursors on abandonment (new Parse/Bind) or explicit Close. Named portals (used by JDBC's setFetchSize) are unaffected as they already survived via the namedPortals map.
Go's JSON marshaller omits the fractional part entirely when a time value has zero sub-second precision, producing values like 2026-03-31T09:02:28Z instead of 2026-03-31T09:02:28.000000000Z. This broke QuestDB timestamp patterns such as .U+ and .N+, which previously required a dot and at least one digit. This fix normalizes the fractional component into optional fraction opcodes in the micros and nanos timestamp compilers so .U+ and .N+ accept either fractional digits or no fractional part at all, while still rejecting a bare dot. Both the generic and ASM parser paths share the relaxed behavior.
This fix resolves an issue where copyOrRebuildColumnIndexes() skipped columns only when colTop == -1, but rebuildPartitionIndexFiles() also skips when colTop >= partitionRowCount (column has no data). When a symbol column was added after a Parquet partition existed, O3 merge set colTop == partitionRowCount. The Parquet-to-native conversion correctly skipped building index files, but the subsequent native-to-Parquet conversion tried to hard-link the non-existent .k/.v files, suspending the table. The skip condition in copyOrRebuildColumnIndexes() was aligned with rebuildPartitionIndexFiles() by also skipping when colTop >= partitionRowCount.
When a backup checkpoint was in progress, the backup read partition data and column files from the live database directory. If partition directories or column version files were deleted while the checkpoint was snapshotting metadata, the backup failed because it tried to read files that no longer existed. This fix adds checkpoint guards to all synchronous and asynchronous file removal paths to prevent deletion while a checkpoint is active. TableWriter.processPartitionRemoveCandidates0 now checks isInProgress() and defers old partition directory removal to async purge when a checkpoint is active, while rollback orphans bypass the guard since they are not in any committed snapshot. TableWriter.finishColumnPurge forces async-only column version purge during checkpoint to prevent synchronous deletion of .d/.i files. The same checkpoint guard is applied for column purge triggered by ALTER COLUMN CONVERT and UPDATE operations. The async purge job (O3PartitionPurgeJob and ColumnPurgeOperator) also skips file deletion during checkpoint, closing the race where the purge job runs before the checkpoint pins a table's scoreboard.
This fix resolves an issue where CopyExportFactory.getCursor() enqueues export work onto a ring queue that a worker thread picks up, but BaseParquetExporter.of() constructed the worker's SqlExecutionContext with a null BindVariableService. This caused a "bind variable service is not provided" error when bind variables were used in the COPY subquery (e.g. COPY (SELECT * FROM t WHERE ts <= $1) TO ...). The fix threads the caller's BindVariableService through CopyExportRequestTask so the worker's execution context receives it. Bind variable values are deep-copied via BindVariableServiceImpl.snapshot() before enqueueing the task, since the PostgreSQL Wire Protocol thread may clear or repopulate the BindVariableService for the next query before the worker starts. The HTTP export path keeps the direct reference since it runs synchronously on the same thread.
This fix resolves a NullPointerException that occurred when WalPurgeJob attempted to fetch metadata for a table that was concurrently dropped. A concurrent drop could close the metadata pool tenant's txFile during refresh, causing a NullPointerException rather than a CairoException. The fix updates WalPurgeJob.fetchSequencerPairs() to catch Throwable (not just CairoException) when getTableMetadata() fails, and similarly updates AbstractMultiTenantPool.get0() in the newTenant path so the pool slot is always released on failure.
This fix corrects an off-by-one error in O3PartitionJob.mergeRowGroup() that could produce row groups exceeding the 1.5x maxRowGroupSize limit by 1 row. The numChunks formula used 3 * maxRowGroupSize as the divisor, representing 2 * (1.5 * maxRowGroupSize) in real arithmetic. When maxRowGroupSize is odd, integer truncation causes a mismatch: 3 * 551 = 1653 vs 2 * (551 + 551/2) = 2 * 826 = 1652. This difference meant that for certain merge totals, the formula computed fewer chunks than needed, and the even-distribution logic gave the first chunk more rows than the 1.5x limit allowed. The fix computes numChunks from the integer-exact maxChunkTarget (maxRowGroupSize + maxRowGroupSize / 2), ensuring no chunk exceeds the 1.5x bound after remainder distribution.
The validateWindowJoins function only recursed into nestedModel and unionModel, missing WINDOW JOIN definitions inside join model subqueries. This left WindowJoinContext uninitialized (lo/hi not computed), causing aggregates to return error results when a WINDOW JOIN appeared under another join, such as inside a subquery used as a JOIN source.
The propagateTopDownColumns0() method emits join-condition columns to each join model, but the expression nodes carried the outer query's table aliases. When the join model was a view, those aliases could not be resolved within the view's own scope, so addTopDownColumn silently dropped them. The optimizer then pruned the column from the view's output, causing an InvalidColumnException at code-generation time when the join tried to look up the missing column. This fix switches to using plain column names without table-alias prefixes, which can be resolved in the join model's alias map.
This fix replaces header click-to-insert with per-column copy buttons and tracks grid selection state to disable "move column to front" when nothing is selected. It resolves Monaco context menu being clipped by overflow containers, consolidates DDL queries to strip redundant blank lines, and adds truncation for long DDL in the table details drawer. Format shortcuts are rebound to Alt+Shift+F / Alt+F, toolbar and result bar spacing is tightened, long materialized view DDLs are truncated, and the LiteEditor icon now switches based on diff vs regular mode.

March 26, 2026

This release introduces new SQL capabilities including the COPY PERMISSIONS statement for cloning access profiles between principals, per-column Parquet encoding and compression controls, and dynamic window ranges in WINDOW JOIN queries. Performance improvements span faster ORDER BY on TCA queries, expanded vectorized GROUP BY coverage, and improved Parquet I/O throughput. The release also addresses several critical bug fixes including data corruption in DECIMAL columns, JVM crashes during backup coordination, and WAL replication race conditions.

New Features

This feature introduces three new permission constants: CONVERT PARTITION TO PARQUET, CONVERT PARTITION TO NATIVE, and SET PARQUET ENCODING. Authorization checks are wired into all enterprise SecurityContext implementations, including EntSecurityContextBase, AdminSecurityContext, and AbstractReplicaSecurityContext. The PermissionParser was updated to handle multi-word permission names containing SQL keywords (such as TO in CONVERT PARTITION TO PARQUET) by adding isPermissionPrefix() to disambiguate keywords from permission name continuations. All three operations are denied on replica security contexts. Previously, these ALTER TABLE operations piggybacked on authorizeAlterTableAlterColumnType and could not be granted or revoked independently.

Bug Fixes

The WAL uploader could enter an infinite retry loop when it encountered a segment whose upload.pending marker file was never created. This occurred when WalWriter.openNewSegment() advanced minSegmentLocked even when createSegmentDir() threw due to a transient disk error. When the distressed WalWriter was later closed, a segmentClosed event fired for a segment that never had its upload.pending file created, causing the Rust uploader to retry deleting the non-existent file indefinitely. This fix makes clear_segment_pending_file in the Rust uploader treat ENOENT as success, since the desired end state — upload.pending absent — is already achieved. This is safe with respect to the WAL cleaner because phantom segments are harmless: object stores return success when deleting non-existent keys, and the cleaner's compaction watermark has gap protection that prevents it from skipping real segments.
In QuestDB Enterprise, exporting to Parquet via COPY could produce spurious "cancelled by user" errors. The SecurityCheckFactory was inserted as a RecordCursorFactory wrapper for all SELECT queries, even though it only mattered for UPDATE queries where it revalidated column-level permissions. This broke an assumption that the outermost factory for a SELECT is always QueryProgress, causing the code to fail to unwrap the factory chain correctly. QueryProgress then interacted badly with the COPY job, triggering the spurious errors. This fix moves authorization revalidation directly into UpdateOperation, eliminating the need for the SecurityCheckFactory wrapper.
After a server restart, collect_missed_segments walks WAL directories to find segments with upload.pending files left over from a previous run. Because the walk is processed asynchronously, it could encounter WAL directories created after the bounce and queue them as "closed" with a stale last_txn set to the sequencer's startup transaction. This caused the upload.pending clearing loop to remove the file prematurely, after which WalPurgeJob deleted the unprotected segment. The uploader, still needing the segment, entered an infinite retry loop and the table's replication became permanently stuck. This fix passes current_wal_id into collect_missed_segments and skips any segment where wal_id > current_wal_id. Since current_wal_id is read from _wal_index.d before ingestion starts, it reliably represents the last pre-bounce WAL ID. Post-bounce segments are handled through the normal SegmentClosed event flow and do not need recovery via collect_missed_segments.
Without this fix, revoking a user's ALTER or UPDATE permission did not take effect for already-cached prepared statements in the PostgreSQL Wire Protocol, allowing the user to continue executing those operations until the connection was closed. Previously, ALTER statements were not cached before the PostgreSQL Wire Protocol layer rewrite, so authorizing them at compile time only was sufficient. Since the PostgreSQL Wire Protocol layer can now cache these commands, this fix ensures that authorization happens at execution time as well.

QuestDB 9.3.4 delivers dynamic windows in WINDOW JOIN, Parquet row group pruning with bloom filters, new array functions, and significant performance improvements across ORDER BY, joins, and Parquet I/O.

Breaking Changes

This change improves query execution performance by folding constant sub-expressions at compile time rather than re-evaluating them on every row. For example, x < '2026-01-01'::timestamp previously kept the cast as an unevaluated function that executed per row, but now evaluates it once at compile time. Additionally, constant reassociation regroups constants of the same associative operator into a single subtree (e.g., col + 1 + 4 becomes col + 5). As a breaking change, float and double Infinity and -Infinity values are now treated as NULL at compile time. Previously, expressions like cast('Infinity' as float) remained unevaluated at runtime and returned Float.POSITIVE_INFINITY, but with constant folding these are now evaluated through FloatConstant.newInstance() / DoubleConstant.newInstance(), which collapses Infinity, -Infinity, and NaN to NULL per QuestDB's convention. CASE expressions can no longer branch on Infinity or -Infinity as distinct float/double values.
This change aligns STRING and SYMBOL ordering with the SQL standard byte-order collation and enables range filter pushdown for string columns in Parquet row group pruning. Any code relying on the previous UTF-16 code unit comparison order for string comparisons may observe different ordering results.
This change improves varchar decoding performance from parquet files by introducing a new internal column type: varchar slice. VarcharSlice aux entries store (length, pointer) pairs pointing directly into mmapped Parquet pages or per-page decompression buffers, eliminating byte copies on the read path. The default encoding for Varchar has been changed from Delta Length Byte Array to RLE Dictionary. Benchmarks show decode-only performance improving from ~4.25ms to ~0.75ms for short strings (8 bytes) at 500K rows with cardinality 256.

New Features

This feature adds minTimestamp and maxTimestamp TIMESTAMP columns to sys.telemetry_wal to capture the data timestamp range per WAL transaction event. WAL telemetry is now enabled by default regardless of the main telemetry setting, reducing the dependency on logs to investigate data writing shape. WalWriter commit log messages have been downgraded from info to debug level unless the commit has a replace range. Schema migration support was also added: when the column count mismatches the expected schema, the table is dropped and recreated, which is safe given the 1-week TTL.
This feature introduces array_sort(DOUBLE[]) and array_reverse(DOUBLE[]) scalar functions that operate on double arrays of any dimensionality. array_sort sorts each innermost-dimension slice independently, preserving the array's shape, and accepts optional boolean arguments for descending order and nulls-first placement. array_reverse reverses each innermost-dimension slice. Both functions handle NULL arrays, empty arrays, NaN values, and multidimensional inputs, and support both contiguous unit-stride and non-vanilla array layouts via separate code paths. The internal sort buffer grows on demand and stays at peak size for the cursor's lifetime to avoid allocation churn on the hot path.
This feature introduces four new SQL functions that operate element-wise across DOUBLE[] arrays. Each function works in two modes: variadic (two or more array arguments, per-row) and aggregate (single array column, GROUP BY / SAMPLE BY). The functions support full N-dimensional arrays with automatic shape broadcasting, where the output shape is the per-dimension maximum of all inputs. NULL arrays and NaN elements are skipped, and positions that receive no finite values yield null. array_elem_sum() and array_elem_avg() use Kahan compensated summation for floating-point accuracy. The GROUP BY average variant uses a uniform/variable dual-mode count tracker to avoid per-element count allocation in the common case. Parallel GROUP BY is supported via merge().
Queries on Parquet partitions can now skip entire row groups that contain no matching rows. Row group pruning uses three strategies: min/max statistics (row groups whose per-column value range does not overlap the filter values are skipped), bloom filters (row groups whose bloom filter reports no match are skipped, opt-in via the bloom_filter_columns option), and null count statistics (for IS NULL and IS NOT NULL filters). Pruning applies to all Parquet read paths including forward scan, backward scan, read_parquet(), and parallel page frame execution. Supported filter operations include equality, IN list, comparison operators (<, <=, >, >=), BETWEEN, IS NULL, IS NOT NULL, and OR-connected equalities on the same column. Supported column types include BYTE, SHORT, CHAR, INT, LONG, FLOAT, DOUBLE, TIMESTAMP, DATE, IPv4, UUID, LONG128, STRING, SYMBOL, VARCHAR, and all DECIMAL widths. Bloom filter columns and false positive probability (FPP) can be specified via ALTER TABLE ... CONVERT PARTITION TO PARQUET WITH (bloom_filter_columns = 'col1,col2', fpp = 0.01), COPY TO export, or the /exp HTTP endpoint. New configuration properties include cairo.sql.parquet.row.group.pruning.enabled (default true), cairo.partition.encoder.parquet.bloom.filter.fpp (default 0.01), cairo.parquet.export.bloom.filter.fpp (default 0.01), and cairo.parquet.export.statistics.enabled (default true).
This feature adds arg_max() and arg_min() function variants that accept a CHAR type for the value argument. Supported signatures include arg_max(char, timestamp), arg_max(char, long), arg_max(char, double), and the corresponding arg_min() variants. Null keys are ignored during aggregation, while null values are returned when the corresponding key is the max/min. All variants support parallel execution. Example usage: SELECT arg_max(status, created_at), arg_min(status, created_at) FROM events;
This feature allows WINDOW JOIN to accept column references and expressions as RANGE BETWEEN boundaries, in addition to static constants, enabling each left-hand-side row to define its own window size based on its data. For example: SELECT t.ts, sum(d.val) AS agg FROM fx_trades t WINDOW JOIN market_data d RANGE BETWEEN t.lookback minutes PRECEDING AND t.lookahead minutes FOLLOWING. Either or both of the lo/hi boundaries can be dynamic while the other remains a static constant. Boundary expressions must evaluate to an integer and must only reference left-hand-side table columns. Negative values are clamped to zero (equivalent to CURRENT ROW), and NULL values produce NULL aggregates where the row is skipped. Each boundary can optionally include a time unit suffix (seconds, minutes, hours, etc.), and when present the value is scaled to the table's timestamp resolution at runtime. Dynamic windows disable the fast symbol-keyed and vectorized execution paths; queries with an ON key equality clause fall back to the general path with a join filter instead. This feature also makes parallel HORIZON JOIN and WINDOW JOIN queries more responsive to query cancellation by adding circuit breaker checks at the top of every master-row iteration loop.
O3 commits into Parquet partitions previously replaced row groups in-place, leaving orphaned bytes in the file. Repeated merges caused file sizes to grow to 2-3x their useful data. This improvement adds a rewrite mode that periodically writes all data to a fresh file, eliminating dead space. An unused_bytes counter tracked in Parquet metadata drives the rewrite decision. Rewrite triggers when the file has a single row group, when unused_bytes / file_size exceeds a configurable ratio (default 0.5), or when absolute unused bytes exceeds a threshold (default 1 GB). In rewrite mode, untouched row groups are raw-copied with adjusted thrift offsets without decode/re-encode. A new O3ParquetMergeStrategy class computes merge/copy/split actions up-front using min/max timestamp overlap detection, replacing the old iterative merge loop. Row groups that exceed the configured size get split into multiple output groups. Small row groups (< 4096 rows) adjacent to a gap absorb the gap's O3 data to avoid proliferation of tiny row groups. WAL tables can now convert their last (active) partition to Parquet, with the TableWriter routing all WAL data through the O3 merge path. New configuration properties: cairo.partition.encoder.parquet.o3.rewrite.unused.ratio (default 0.5) and cairo.partition.encoder.parquet.o3.rewrite.unused.max.bytes (default 1g).
Users can now specify Parquet encoding and compression on a per-column basis using CREATE TABLE and ALTER TABLE SQL syntax. The syntax is PARQUET(encoding [, compression[(level)]]), where both encoding and compression are optional — use default for the encoding when specifying compression only. When omitted entirely, the column uses the global defaults. For example: CREATE TABLE sensors (ts TIMESTAMP, temperature DOUBLE PARQUET(rle_dictionary, zstd(3)), device_id VARCHAR PARQUET(default, lz4_raw)) TIMESTAMP(ts) PARTITION BY DAY;. Existing tables can be modified with ALTER TABLE sensors ALTER COLUMN temperature SET PARQUET(rle_dictionary, zstd(3)); or reset with ALTER TABLE sensors ALTER COLUMN temperature DROP PARQUET;. Per-column config appears in SHOW CREATE TABLE output. Supported encodings include plain, rle_dictionary, delta_length_byte_array, and delta_binary_packed, with type-specific restrictions. Supported compression codecs include uncompressed, snappy, gzip (0-9), brotli (0-11), zstd (1-22), and lz4_raw. RLE dictionary encoding is now supported for all column types except Boolean and Array. A new cairo.partition.encoder.parquet.min.compression.ratio configuration property (default 1.2) controls whether compressed pages are worth keeping — when a compressed column chunk fails to meet the ratio threshold, the encoder discards the compressed output and stores it uncompressed instead. Varchar dictionary encoding performance was significantly improved by switching from the default hasher to RapidHashMap and storing indices directly in a Vec<u32>, achieving ~87 Melem/s throughput compared to the original ~13 Melem/s.

Improvements

WalWriter previously hardcoded POSIX_MADV_RANDOM for memory-mapped column files, which hurts most workloads. This improvement makes the madvise hint configurable via the cairo.wal.writer.madvise.mode property with valid values: none (default, no hint), sequential, and random. The random mode is beneficial when ingesting into many tables with many columns, as it prevents the OS from speculatively reading adjacent pages under memory pressure.
This improvement replaces linear scanning with binary search for initial frame positioning in ASOF JOIN, LT JOIN, and WINDOW JOIN operations. The openSlaveFrame() method in AbstractAsOfJoinFastRecordCursor and the findRowLo()/findRowLoWithPrevailing() methods in WindowJoinTimeFrameHelper now call seekEstimate() on the first lookup to binary-search directly to the target partition instead of linearly scanning all preceding frames. This reduces the initial positioning cost from O(N) in the number of frames to O(log P) where P is the number of partitions, mirroring the optimization already present in HORIZON JOIN.
This improvement brings significant performance gains to Parquet decoding, with a median improvement of approximately 87% across column type and encoding combinations. Key optimizations include removing unnecessary allocation and zeroing in the decompression path, skipping definition decoding when null_count=0, improved Plain, DeltaBinaryPacked, Rle, and RleDictionary decoder implementations, a lookup table for boolean unpacking, specialized UUID byte swapping, batch decoding for DeltaBinaryPacked, batched nullable bitmap processing, and bulk copying for PlainPrimitiveDecoder. This improvement also fixes a bug with incorrect null sentinel values.
Vectorized execution is now applied to more non-keyed GROUP BY queries, such as SELECT first(price), last(price) FROM trades. Internally, the GroupByFunction#computeBatch() API is used for all queries without a filter. This API was previously used only in WINDOW JOIN.
This improvement introduces an encoded sort path for ORDER BY queries. SortKeyEncoder encodes column values into fixed-width, order-preserving binary keys (8/16/24/32 bytes), enabling comparisons via native uint64 operations instead of per-column type dispatch. The native sort is a three-layer hybrid: vergesort for detecting natural runs (O(n) for pre-sorted time-series data), MSD radix sort (American Flag Sort) with parallel bucket sorting across available cores via atomic work-stealing for large arrays, and pdqsort for small partitions with heapsort fallback guaranteeing O(n log n) worst-case. The entire native sort path is allocation-free with all data structures stack-allocated. It replaces the R-B tree (RecordTreeChain) with flat array collect plus radix sort for the non-random-access path, and guarantees stable sort by falling back to rowId ordering when all sort key columns are equal. Key size is capped at 32 bytes; ORDER BY clauses exceeding this fall back to the existing R-B tree sort. Benchmarks show a 5-column ORDER BY over 3M rows improving from 6.98s to 1.26s, and a 2-column ORDER BY over 1B rows improving from over 15 minutes to 63.6s with parallel sorting.
This improvement replaces string comparison (getSym() + Chars.compare) with rank-based comparison (getInt() + rank map lookup) when sorting SYMBOL columns that have static symbol tables. At cursor open, a rank map maps each symbol key to its alphabetical rank, and the comparator then compares ranks instead of strings. It also eliminates the temporary DirectIntList used when building symbol rank maps by replacing it with in-place permutation inversion after quicksort, reducing extra space from O(N) to O(1) where N is the symbol count, mainly benefiting high-cardinality symbol columns. This optimization applies to all comparator-based sort paths: tree-based sort cursors, top-K cursors, window function tree comparators, and window function internal comparators (rank(), percent_rank()). Non-static symbol columns (e.g. CAST(str AS SYMBOL)) fall back to the original string comparison path.
This improvement introduces an adaptive scan strategy for HORIZON JOIN's keyed ASOF lookup. The lookup now starts in backward-only mode, which is cheap for low-cardinality key spaces, and adaptively switches to forward scan mode within a frame when backward scan cost becomes excessive — for example, with high-cardinality symbols or rare/infrequent keys that cause deep backward scans. The switch uses two criteria: a relative threshold where backward scan cost at a position must exceed 8x the gap to trigger (with a minimum gap of 1,024 rows), and an absolute threshold of 131,072 rows to handle cross-partition boundaries where the relative check cannot trigger. Once switched, the frame stays in forward mode for its remainder. This also re-enables the ASOF row cache in HorizonJoinTimeFrameHelper and resets bookmarks on toTop() to eliminate jitter from out-of-order frame processing. Performance improvements range from 13-24% for dense and equity scenarios to 96% for sparse data distributions.

Bug Fixes

A LATEST BY ALL query on a large table could fill an OrderedMap until keyCapacity reached 2^30. When rehash() doubled to newKeyCapacity = 1L << 31, the overflow guard allowed it through because MAX_SAFE_INT_POW_2 was incorrectly set to 1L << 31. The subsequent (int) cast produced Integer.MIN_VALUE, and clear() computed ~18.4 EB for native memset, causing a SIGSEGV. This fix corrects Numbers.MAX_SAFE_INT_POW_2 from 1L << 31 to 1L << 30, so the guard rejects the overflow and throws a clean CairoException("map capacity overflow") instead of crashing the JVM. The constant was also deduplicated from Unordered4Map, Unordered8Map, and UnorderedVarcharMap, each of which had a private copy with the same bug.
The cairo.partition.encoder.parquet.statistics.enabled configuration allows users to disable Parquet statistics, but the read path (ParquetTimestampFinder, TableWriter) and the O3 merge path (O3PartitionJob.processParquetPartition) hard-depended on timestamp column statistics. When statistics were absent, getMinValueLong would hit an assertion crash with -ea or read garbage memory causing silent data corruption without -ea. This fix adds rowGroupMinTimestamp and rowGroupMaxTimestamp methods to PartitionDecoder that try Parquet column statistics first at zero cost, then fall back to decoding the first/last row from actual data pages when statistics are absent. findRowGroupByTimestamp also falls back to decoding instead of reading garbage memory. O3PartitionJob, ParquetTimestampFinder, and TableWriter have been migrated to use the new methods.
This fix resolves a crash that occurred when using SAMPLE BY with FILL(NULL), FILL(value), or FILL(PREV) in queries containing array column aggregates such as last(arr). The fill record types did not implement getArray(), and the null/value fill factories did not handle array column types when constructing fill constants. The fix adds getArray() overrides to SampleByFillRecord and FillRangeRecordCursorFactory.FillRangeRecord, and updates SampleByFillNullRecordCursorFactory and SampleByFillValueRecordCursorFactory to handle array types by yielding NullConstant.NULL, since arrays cannot be filled with scalar values.
When a JOIN query had a WHERE clause containing both a column-referencing condition and a non-column constant expression (e.g., NOW() = NOW()), QuestDB crashed with an internal AssertionError. The SQL optimizer splits the WHERE clause into separate buckets — one for column-referencing conditions and one for constant expressions. The code generator then applied each bucket as a separate FilteredRecordCursorFactory wrapper, but nesting these factories violated an internal assertion. This fix detects when the base factory is already a FilteredRecordCursorFactory, extracts the existing filter, and combines both filters with AND into a single factory instead of nesting them.
ALTER TABLE ... ALTER COLUMN and ALTER MATERIALIZED VIEW ... ALTER COLUMN failed when the column name was quoted (e.g., "MY_COL"). The token from the lexer was used as-is for the metadata lookup, so the quoted form was looked up instead of the unquoted name, resulting in a "column does not exist" error. This fix wraps the token with unquote() before the column name lookup in both compileAlterTable and compileAlterMatView.
The non-vectorized fast cursor (WindowJoinWithPrevailingFastRecordCursor) compared the first window match against the window end instead of the window start, causing the prevailing row to effectively never be included when the window already contained matches. The vectorized variant already used the correct comparison. This fix aligns the non-vectorized path to compare against the window start timestamp. The bug only affected queries using non-vectorizable aggregates (e.g., max(concat(...))) with symbol-keyed WINDOW JOIN and INCLUDE PREVAILING.
The read_parquet() function crashed with SIGSEGV when reading parquet files containing SYMBOL columns encoded by QuestDB's PartitionEncoder. The canProjectMetadata() method passed the actual column type (SYMBOL) to the Rust decoder instead of the expected type (VARCHAR), causing the Rust decoder to write INT32 symbol keys that Java then read as VARCHAR pointers. This fix passes the expected type (VARCHAR) for SYMBOL-to-VARCHAR conversions so the Rust decoder resolves dictionary entries to UTF-8 strings.
This fix addresses two bugs in WINDOW JOIN with INCLUDE PREVAILING. The first bug affected four sync cursor variants that call findRowLo(lo, hi, true): WindowJoinWithPrevailingAndJoinFilterRecordCursor, WindowJoinWithPrevailingAndJoinFilterFastRecordCursor, WindowJoinWithPrevailingFastRecordCursor, and WindowJoinFastVectRecordCursor. When the bookmarked-frame optimization kicked in and the prevailing candidate sat in a prior partition, the bookmarked path skipped initialization of prevailingFrameIndex/prevailingRowIndex, leaving them at -1/Long.MIN_VALUE. Downstream backward scans saw -1 and returned immediately, silently omitting the prevailing row from results. The fix seeds prevailingFrameIndex/prevailingRowIndex when entering via the bookmark path. The second bug was a comparison target error in WindowJoinWithPrevailingFastRecordCursor where the condition that triggers prevailing inclusion checked whether the first row's timestamp exceeded the window high boundary (masterTimestampHi) instead of the window low boundary (slaveTimestampLo).
This fix addresses several correctness, resource leak, and crash issues in the SQL engine. A missing break after the DECIMAL inner switch in generateCastFunctions caused fall-through to the BINARY case, adding a spurious BinColumn to the cast function list and shifting subsequent column indices in multi-column UNION queries, resulting in UnsupportedOperationException at runtime. The moveClauses method never incremented its position counter, so when swapJoinOrder0 needed to move multiple join clauses, only the first was moved — the rest stayed in the wrong join context, creating circular dependencies that broke topological sort on 3+ table cross joins with multi-column WHERE conditions. The hasGroupByFunc/hasOrderedGroupByFunc methods skipped children of non-aggregate functions, making nested aggregates like abs(sum(x)) invisible and causing PIVOT to reject valid aggregate expressions. Resource leaks were fixed in generateJoins where const filters were leaked on error paths, and in generateSampleBy where timezoneNameFunc, offsetFunc, sampleFromFunc, and sampleToFunc were never freed on error paths. A NullPointerException in CREATE MATERIALIZED VIEW was fixed — when WITH BASE specifies a nonexistent table not referenced in the query, getTableTokenIfExists() returned null and .isView() threw NPE instead of a descriptive SqlException. Additional latent bugs were fixed in generateFill (always reading the first fill value instead of iterating) and in GeoHash-to-VARCHAR cast functions (using wrong column type parameter).
This fix corrects a data corruption issue where set_memory_vanilla_vec() delegated to run_vec_bulk, which used TVec::size() as the loop increment. For Vec8uq this returned 8 (the number of uint64_t lanes), but one 512-bit store only covers 4 long_128bit elements or 2 long_256bit elements. The mismatch caused the bulk path to advance past unwritten elements, leaving holes filled with stale data. The fix replaces the broken run_vec_bulk call with a dedicated loop that computes the correct per-store element count from sizeof(TVec) / sizeof(T), guarded by a static_assert.
This fix defaults cairo.mat.view.parallel.sql.enabled to false on machines with fewer than 4 available processors. All other parallel SQL features (filter, group by, top-k, horizon join, window join, Parquet read) remain enabled regardless of core count. Users can still explicitly enable materialized view parallel SQL via configuration on low-core machines.
This fix resolves an issue where read_parquet failed with "encoding not supported" errors when reading Parquet files whose embedded QuestDB metadata schema had a different number of columns than the actual Parquet data. This happens when an external tool (DuckDB, Spark, PyArrow) rewrites a QuestDB-exported Parquet file — for example, dropping partition columns for hive-style directory layouts — but preserves the original key-value metadata. The decoder now compares the QuestDB metadata schema length against the Parquet column count and discards the metadata when they differ, falling back to physical type inference.

March 3, 2026

This release introduces automatic object store WAL cleanup on the primary node, TLS certificate expiration metrics for Prometheus alerting, and new array_sort() and array_reverse() functions. Performance improvements include faster ASOF and WINDOW JOINs through binary search–based frame positioning, along with a configurable WAL writer madvise mode. Several critical bug fixes address crashes in LATEST BY ALL queries, Parquet reads with missing statistics, and backup restore edge cases, while the ACL permission system has been expanded to support up to 256 permissions.

New Features

This feature introduces a WAL cleaner that runs on the primary node and automatically deletes replicated WAL data from object storage once it is no longer needed by any replica or backup. It determines what is safe to delete by consulting two sources of cleanup history — enterprise backup manifests and checkpoint history records — and always retains enough data to support the most recent N backups or checkpoints. The cleaner is conservative by default: it won't delete anything until sufficient history exists, and it picks the most conservative boundary when multiple sources or cluster nodes are involved. Key components include a checkpoint history tracker that records per-table transaction state to the shared replication object store on each CHECKPOINT RELEASE, a backup instance name registry for coordinating cleanup boundaries across multiple nodes, rate limiting and throttling for object store delete operations with auto-tuned defaults per cloud provider (S3, GCS, Azure Blob, R2, etc.), and crash recovery with periodic progress persistence so cleanup resumes where it left off after a restart. Dropped tables are cleaned up after a cooloff period (default 1h) to guard against clock skew. Key configuration properties include replication.primary.cleaner.enabled (default true), replication.primary.cleaner.interval (default 10m), replication.primary.cleaner.backup.window.count (default 5), replication.primary.cleaner.delete.concurrency (auto-tuned 4–12), replication.primary.cleaner.max.requests.per.second (service-dependent), and checkpoint.history.enabled (default true when replication is enabled).
This feature adds Prometheus gauge metrics for TLS certificate time-to-live (TTL) across all four TLS-enabled endpoints: questdb_tls_cert_ttl_seconds_http, questdb_tls_cert_ttl_seconds_http_min, questdb_tls_cert_ttl_seconds_line, and questdb_tls_cert_ttl_seconds_pg. Each gauge reports seconds until the active certificate expires. Values greater than 0 indicate seconds remaining, 0 means expired, and -1 means the certificate has not been loaded or could not be parsed. Gauges are only registered for endpoints where TLS is enabled. The TTL is computed from the certificate's notAfter field, which is extracted via a JNI call into a minimal DER/X.509 parser on the Rust side. The expiration epoch is cached and updated on reload_tls(), so the metric always reflects the active in-memory certificate, not the one on disk.
This feature adds array_sort(DOUBLE[]) and array_reverse(DOUBLE[]) scalar functions that operate on double arrays of any dimensionality. array_sort() sorts each innermost-dimension slice independently, preserving the array's shape, and accepts optional boolean arguments for descending order and nulls-first placement. array_reverse() reverses each innermost-dimension slice. Both functions handle NULL arrays, empty arrays, NaN values, and multidimensional inputs. They support both contiguous unit-stride and non-vanilla array layouts via separate code paths. The sort buffer grows on demand and stays at peak size for the cursor's lifetime to avoid allocation churn on the hot path.
This feature adds minTimestamp and maxTimestamp TIMESTAMP columns to sys.telemetry_wal to capture the data timestamp range per WAL transaction event. WAL telemetry is now enabled by default regardless of the main telemetry setting, reducing the dependency on logs to investigate data writing shape. The WalWriter commit log message has been downgraded from info to debug level unless the commit has a replace range. Schema migration support has been added so that when the column count mismatches the expected schema, the table is dropped and recreated, which is safe given the 1-week TTL.

Improvements

This improvement mirrors the optimization already present in HORIZON JOIN. Without it, the first lookup linearly scans through all slave time frames preceding the master's first timestamp, which is O(N) in the number of frames. With the seekEstimate() optimization, the initial positioning is O(log P) where P is the number of partitions. Specifically, AbstractAsOfJoinFastRecordCursor.openSlaveFrame() now calls seekEstimate() on the first slave frame lookup to binary-search directly to the target partition instead of linearly scanning all preceding frames, benefiting all ASOF JOIN and LT JOIN fast-path factories. WindowJoinTimeFrameHelper.findRowLo() and findRowLoWithPrevailing() also now call seekEstimate() on the first lookup with the same partition-skipping behavior, benefiting both sync and async WINDOW JOIN factories.
WalWriter previously hardcoded POSIX_MADV_RANDOM for memory-mapped column files, which hurts most workloads. This improvement makes the madvise hint opt-in via a new configuration property cairo.wal.writer.madvise.mode with valid values: none (default, no hint), sequential, and random. The random mode is beneficial when ingesting into many tables with many columns, as it prevents the OS from speculatively reading adjacent pages under memory pressure.

Example configuration:
```
cairo.wal.writer.madvise.mode=random
```
This improvement refactors the ACL permission system to support more than 64 permissions by migrating from 64-bit bitmasks to an exponent-based representation with 256-bit aggregate masks. Permission constants changed from long bitmasks to int exponents, and a new PermissionMask class provides 256-bit storage (4 longs) for aggregate permission sets. The permission column type in the database schema changed from long to short (storing exponents instead of bitmasks), reducing storage overhead while supporting up to 256 distinct permissions. PermissionMask.ZERO is now immutable and throws on mutation attempts, and the sentinel value handling for ALL_PERMISSIONS is properly supported across has, set, and clear operations.

Bug Fixes

This fix corrects Numbers.MAX_SAFE_INT_POW_2 from 1L << 31 to 1L << 30. The old value (2^31) does not fit in a signed 32-bit int, so the rehash overflow guard let exactly 2^31 through. The subsequent (int) cast produced Integer.MIN_VALUE, and clear() fed approximately 18 EB to native memset, causing a SIGSEGV. The crash chain occurred when a LATEST BY ALL query on a large table filled an OrderedMap until keyCapacity reached 2^30, then rehash() doubled to newKeyCapacity = 1L << 31, which truncated to a negative value and passed an enormous size to native memset. The fix makes the guard reject newKeyCapacity = 2^31, throwing a clean CairoException("map capacity overflow") instead of crashing the JVM. The constant was also deduplicated from Unordered4Map, Unordered8Map, and UnorderedVarcharMap, each of which had a private copy with the same bug.
The cairo.partition.encoder.parquet.statistics.enabled configuration allows users to disable Parquet statistics, but the read path (ParquetTimestampFinder, TableWriter) and the O3 merge path (O3PartitionJob.processParquetPartition) hard-depended on timestamp column statistics. When statistics were absent, getMinValueLong would hit an assertion crash with -ea or read garbage memory causing silent data corruption without -ea. This fix removes that hard dependency by adding rowGroupMinTimestamp and rowGroupMaxTimestamp methods to PartitionDecoder that try Parquet column statistics first at zero cost, then fall back to decoding the first/last row from actual data pages when statistics are absent. The findRowGroupByTimestamp method also falls back to decoding instead of reading garbage memory, and O3PartitionJob, ParquetTimestampFinder, and TableWriter have been migrated to use the new methods.
During backup, partitions with row_count=0 may not produce meta.msgpack. The restore process previously always attempted to download this file when hash verification was enabled, causing restores to fail with "no partition metadata found" for empty partitions. This fix skips downloading meta.msgpack for empty partitions during restore and skips hash verification in that case, while still requiring metadata for non-empty partitions.
A primary instance restored from a backup could encounter a race condition between the WalPurgeJob and dropped table request processing, where the uploader could not open the _txnlog because it had already been deleted. This fix ensures that the WalPurgeJob does not delete state that the uploader still needs. The bug was unique to backups since backups do not restore .pending files used to control this workflow. At start-up, the adjusted WalUploader's replication logic now ensures that the appropriate .pending file is recreated when missing, which also patches instances that were already restored with older versions of QuestDB Enterprise.
This fix introduces three new ACL permissions: ALTER SYMBOL CAPACITY (column-level), SET REFRESH LIMIT (table-level), and SET REFRESH TYPE (table-level). ALTER SYMBOL CAPACITY replaces the incorrect reuse of ALTER COLUMN TYPE for symbol capacity changes. A startup migration automatically grants ALTER SYMBOL CAPACITY to every entity that previously had ALTER COLUMN TYPE, at the same scope (database, table, or column level), preserving grant options. SET REFRESH LIMIT and SET REFRESH TYPE gate the previously unprotected ALTER MATERIALIZED VIEW ... SET REFRESH LIMIT/IMMEDIATE/MANUAL/EVERY/PERIOD operations. This fix also wires in the previously commented-out authorization check for ALTER TABLE SET PARAM and adds a retry loop with recompile in access list reloading to handle table reference out-of-date exceptions during ACL reload.

February 25, 2026

QuestDB 9.3.3 is a feature-rich release introducing HORIZON JOIN for markout analysis, a new twap() aggregate, SQL-standard WINDOW definitions, JIT compilation on ARM64, and file-based secrets for Kubernetes deployments. It also brings significant performance improvements across Parquet I/O, parallel GROUP BY, UNION queries, and ORDER BY on computed expressions.

New Features

Sensitive configuration options can now be loaded from files using the _FILE suffix convention, enabling seamless integration with Kubernetes Secrets, Docker Secrets, HashiCorp Vault, and other file-based secrets management systems. For example, setting QDB_PG_PASSWORD_FILE=/run/secrets/pg_password as an environment variable or pg.password.file=/run/secrets/pg_password in server.conf will load the password from the specified file. This feature works with all sensitive properties (pg.password, pg.readonly.password, http.password). File contents are automatically trimmed of whitespace. SHOW PARAMETERS displays value_source = 'file' for secrets loaded from files, and secrets are reloaded when file contents change via SELECT reload_config().
This feature allows underscores as number separators in TICK date expressions (e.g., $now-10_000T) and duration suffixes (e.g., ;1_500T), consistent with Numbers.parseInt() which already supports this. Error position reporting in date expression evaluation has also been improved to point at the specific offending location instead of the start of the expression. Invalid underscore placement (leading, trailing, consecutive) is properly detected and reported.
This feature enables QuestDB's JIT code generation to run natively on ARM64 systems in addition to x86. The changes introduce ARM64-specific implementations for register handling, caches, and code generation, while maintaining full compatibility with the existing x86 logic. Conditional compilation is used throughout to select the appropriate architecture at build time. ARM64-specific versions of Function and CountOnlyFunction structs were implemented, including methods for code generation, register setup, and loop logic that mirror the x86 implementations but use ARM64 instructions and register types.
This feature introduces the WINDOW clause, allowing users to define named window specifications that can be reused across multiple window functions within a single query. Window inheritance following the SQL standard is also supported, where a named window can reference another named window as its base, inheriting PARTITION BY, ORDER BY, and frame clauses. Chained inheritance is supported (e.g., w3 references w2, which references w1). Merge rules follow the SQL standard: PARTITION BY is always inherited from the base (child cannot specify its own), ORDER BY in the child takes precedence if specified, and frame clauses in the child take precedence if non-default. Validation ensures base windows must be defined earlier in the WINDOW clause, PARTITION BY in a child window referencing a base is rejected, and circular and self-references are prevented.
This feature introduces the twap(price, timestamp) aggregate function that computes the time-weighted average price using step-function integration, where each price is held constant until the next observation and the TWAP is the area under this step function divided by the total time span. The function supports parallel GROUP BY execution via per-worker native buffers that are merge-sorted during the merge phase, since workers process non-adjacent page frames via work-stealing and an incremental weighted-sum approach would incorrectly bridge gaps between frames. It falls back to a simple arithmetic mean when all observations share the same timestamp. SAMPLE BY with FILL modes is also supported.
This feature adds support for all six decimal types (DECIMAL8, DECIMAL16, DECIMAL32, DECIMAL64, DECIMAL128, and DECIMAL256) in the Parquet format. Write support stores decimals as fixed-length byte arrays in big-endian format per the Parquet specification. Read support includes a WordSwapDecimalColumnSink for DECIMAL128 and DECIMAL256 that correctly converts from Parquet's big-endian format by reversing each 8-byte word independently. The supported Parquet physical types are INT32 for small decimals (DECIMAL8/16/32), INT64 for DECIMAL64, and FixedLenByteArray for all decimal types.
This feature introduces array_build(nDims, size, filler1, ...) that creates DOUBLE[] or DOUBLE[][] arrays with controlled shape and fill values. The size parameter accepts a scalar integer or a DOUBLE[] (using its cardinality). Each filler can be a scalar (repeated for all elements) or a DOUBLE[] (copied element-by-element, NaN-padded if shorter, truncated if longer).
This feature introduces HORIZON JOIN, a specialized time-series join designed for markout analysis — a common financial analytics pattern where you need to analyze how prices or metrics evolve at specific time offsets relative to events such as trades or orders. For each row in the left-hand table and each offset in the horizon, the join computes left_timestamp + offset and performs an ASOF match against the right-hand table. When join keys are provided via ON, only right-hand rows matching the keys are considered. The horizon is defined using either a RANGE FROM <from> TO <to> STEP <step> clause for uniform offsets, or a LIST (<offset>, ...) clause for explicit non-uniform offsets, both aliased with AS. The pseudo-table exposes .offset (LONG) and .timestamp (TIMESTAMP) columns. Both RANGE and LIST use the same interval expression syntax as SAMPLE BY with supported units: U (microseconds), T (milliseconds), s (seconds), m (minutes), h (hours), d (days). Use cases include markout P&L analysis, event impact studies, and time-series correlation at different lags. Current limitations include no combination with other joins at the same query level, no right-hand side WHERE filters, both tables requiring a designated timestamp, positive STEP with FROM ≤ TO for RANGE, and monotonically increasing offsets for LIST.
```
SELECT h.offset / 1000000 AS horizon_sec, t.sym, avg(m.mid) AS avg_mid
FROM trades AS t
HORIZON JOIN mid_prices AS m ON (t.sym = m.sym)
RANGE FROM 1s TO 60s STEP 1s AS h
ORDER BY t.sym, horizon_sec
```
This feature adds a Table Details Drawer to the Web Console for tracking the health and details of a table or materialized view. It simplifies the table schema explanation flow by removing flow-specific schema and adding streaming support, along with a new flow for table health issues. An adaptive polling mechanism checks request latency proactively and sets the interval dynamically. WAL tables polling has been removed except for the suspension dialog. The feature also includes monitoring support in the AI Assistant docs, updated tooltip behavior and styling throughout the application, query truncation support for LiteEditor, history and navigation among right-hand-side bar drawers, and a mechanism to copy table and column names using Ctrl/Cmd+C.
This feature adds import and export support for editor tabs, makes tabs scrollable with an increased minimum width, adds a rename button that appears on tab hover, and updates the editor tab styling.

Improvements

This improvement introduces three optimizations for Parquet partition reads. Late materialization decodes only filter columns first to identify matching rows, then decodes remaining columns only for those rows, significantly reducing decoding overhead for low-selectivity queries. Zero-copy mmap page reading uses a new SlicePageReader that reads Parquet pages directly from the memory-mapped byte slice, bypassing the previous approach of copying page data into an intermediate buffer. Raw array encoding is now enabled by default for partition-to-Parquet conversion, avoiding the overhead of Parquet's nested LIST decoding. In benchmarks, an OHLC aggregation query on an 8-day Parquet partition improved from 600ms to 250ms.
This improvement speeds up Parquet export through two optimizations. SIMD-accelerated encoding uses portable SIMD intrinsics for nullable INT/LONG columns, processing 64 values per iteration for vectorized definition level encoding, yielding up to 16.5% faster throughput and 20% less CPU usage with LZ4RAW compression. A new streaming mode for TableReader applies MADV_SEQUENTIAL on mmap for aggressive read-ahead prefetching and MADV_DONTNEED before munmap for immediate page cache release. Streaming mode bypasses the mmap cache so each partition mapping is independent and fully releasable. Parquet exporters automatically enable streaming mode. Under memory pressure, the combination of both madvise hints recovers 94% of baseline performance (387 MB/s vs 82 MB/s without hints).
This improvement pushes designated-timestamp filters from outside a UNION / UNION ALL / EXCEPT / INTERSECT subquery into each branch of the set operation, enabling per-branch partition pruning on time-series tables. The optimizer only pushes filters that exclusively reference the designated timestamp column, deliberately excluding non-timestamp filters to avoid type mismatches and column resolution issues across branches with different schemas. The outer filter always stays on the wrapper model, so correctness is unchanged. For composite filters like WHERE ts IN '2025-12-01T01;2h' AND x > 5, the optimizer splits the top-level AND into separate conjuncts and processes each independently: it pushes the timestamp conjunct into branches, while x > 5 stays at the parent level only. Safety mechanisms include per-branch semantic guards that skip branches with LATEST BY, LIMIT, or non-pushable SAMPLE BY, column existence checks, and deep cloning to prevent cross-branch mutation.
This improvement adds an unordered mode to PageFrameSequence that uses SOUnboundedCountDownLatch instead of ordered collection, eliminating head-of-line blocking for reducers that don't need ordered results. Keyed GROUP BY, non-keyed GROUP BY, and Top K factories now use this unordered mode, replacing boilerplate ordered collect loops with a single dispatchAllAndAwait() call. Error field copying in storeError() was also fixed to use a dedicated CairoException instance, preventing thread-local exception recycling from corrupting stored error data. At concurrency 8, per-iteration spread (a direct proxy for head-of-line blocking) was reduced by 53–75% across query types, tail latency (p99) improved by 30–44%, and latency predictability (p99/p50 ratio) improved from 2.5–3.6x down to 1.4–1.8x. Throughput at concurrency 8 improved by 6–9% for most query types, and throughput now scales monotonically from concurrency 1 to 2 instead of regressing as on the previous implementation.
CASE expressions on symbol columns with static symbol tables now resolve string constants to integer symbol keys at initialization time and compare by integer at runtime, avoiding per-row string comparisons. Three picker specializations are used based on branch count: single-branch (one equality check), dual-branch (two equality comparisons), and multi-branch (integer-object hash map lookup). When a WHEN value is not found in the symbol table, the branch is silently skipped. Non-static symbol tables (e.g., from casts) fall back to the existing character sequence-keyed comparison.
This improvement replaces the Dragon4 double-to-string algorithm with Ryu (Ulf Adams, 2018), which produces minimal-length, correctly-rounded decimal representations with better performance. A direct double-to-decimal conversion path (Numbers.doubleToDecimal) bypasses string formatting entirely, removing the double→string→decimal roundtrip in CastDoubleToDecimalFunctionFactory. The dependency on jdk.internal.math.FDBigInteger has been eliminated, removing the --add-exports java.base/jdk.internal.math compiler flag.
When sorting by computed expressions (e.g., ORDER BY a + b * c), QuestDB's SortedLightRecordCursor previously re-evaluated the expression on every comparison during tree insertion and output, meaning sort key functions were called O(N log N) times during sorting plus O(N) during output. This improvement adds a sort key materialization layer that pre-computes expensive sort key values into off-heap buffers before sorting, reducing function evaluations from O(N log N) to O(N). Each function gets a complexity score that propagates through the expression tree. If any sort key column exceeds the configured complexity threshold (default 3, configurable via cairo.sql.sort.key.materialization.threshold), that column's values are pre-computed and stored in per-column memory buffers. A materializing record intercepts reads for materialized columns from the buffer and delegates to the base record for everything else. The feature supports two-pass operation where subsequent passes after toTop() look up existing ordinals without re-computing values. This optimization applies to fixed-size types only (no string/varchar/binary materialization) and operates on the general sort path only (no Top-K modification).
This improvement replaces the digit-by-digit loop in doubleToDecimal() with a single wide multiply via a new ofDigitsAndPower(long digits, int power) method on the Decimal interface. The old code extracted up to 17 individual digits from the Ryu significand via long division, then performed 17 wide additions (128-bit or 256-bit). The new code multiplies the significand by a single power of 10. Benchmarks show Decimal64 is 1.8x faster, Decimal128 is 2.2x faster, and Decimal256 is 2.8x faster. The three types converge to roughly the same cost since the single multiply dominates and the word-width difference matters less for one operation than for 17.
This improvement introduces a hybrid export mode for Parquet export (COPY ... TO and HTTP /exp) that passes through raw page-frame-backed columns zero-copy and materializes only computed columns into native buffers, row by row per frame. Queries with no page-frame backing (e.g., cross joins) now use a cursor-based export that materializes all columns row by row, also avoiding the temporary table detour. Only queries requiring re-partitioning (PARTITION BY override) or containing a computed BINARY column still use the temporary table path. The export mode is determined by inspecting the compiled RecordCursorFactory before constructing any temporary table: DIRECT_PAGE_FRAME for factories supporting PageFrameCursor directly (zero-copy), PAGE_FRAME_BACKED for virtual record cursor factories whose base supports page frames with no computed BINARY columns, CURSOR_BASED for no page-frame backing or descending ORDER BY with computed columns, and TEMP_TABLE for PARTITION BY overrides or computed BINARY columns. The HybridColumnMaterializer handles both PAGE_FRAME_BACKED and CURSOR_BASED modes, converting computed SYMBOL columns to STRING in adjusted metadata while dispatching via source type for correct record accessor usage. A buffer pool recycles computed-column native buffers rather than freeing them after each row group flush. This improvement also fixes several resource lifecycle issues on error paths including page-frame cursor leaks, temp directory leaks on moveFile() failure, and use-after-free on error path cleanup. The exporter hierarchy was refactored so that BaseParquetExporter holds shared state while HTTPSerialParquetExporter and SQLSerialParquetExporter extend it independently, removing the prior dependency where the SQL exporter extended the HTTP exporter.

Bug Fixes

When QuestDB fails to start due to configuration errors, a race condition between the JVM shutdown hook and the startup script could produce confusing cat and rm error messages for the hello.txt file. This fix redirects stderr to /dev/null for those commands and adds proper quoting around file paths for robustness with paths containing spaces. The actual startup failure details remain available in the stdout-*.txt log files.
Concurrent Parquet exports could produce corrupted or truncated files when multiple clients exported simultaneously through the same HTTP worker. The issue occurred because ExportQueryProcessor stored per-connection state in per-processor fields. When a worker parked one connection via PeerIsSlowToReadException and began serving another, these fields were overwritten, causing the onWrite() callback to write Parquet chunks to the wrong response on resume. This fix moves the affected fields to per-connection ExportQueryProcessorState.
This fix corrects greedy nanosecond parsing (single N followed by non-digit) to be consistent with milliseconds and microseconds. Previously, .SSSUUUN with input .1234567 produced .123456007Z (7ns) instead of the correct .123456700Z (700ns).
This fix improves the reliability of memory-mapped file handling for columns on Windows platforms. A fallback mechanism for mapping files on Windows has been introduced, which gracefully handles transient permission errors ("Access Denied" even after successful file open) by falling back to an anonymous memory map populated by reading the file contents directly. On non-Windows platforms, the original mapping logic is retained.
This fix addresses an issue where DynamicPropServerConfiguration.reload() accumulated stale entries in changedKeys across reload cycles when only a secret file changed while the properties file remained unchanged. This caused watchers to be spuriously notified about properties that did not actually change. The root cause was that changedKeys.clear() only lived inside updateSupportedProperties(), which is skipped when the properties file hasn't changed. The fix moves changedKeys.clear() to the top of reload(), inside the synchronized block, so every reload cycle starts fresh.
This fix resolves an issue where the ALTER TABLE t ADD COLUMN IF NOT EXISTS col <type> path did not fully resolve parameterized or compound column types before comparing with existing metadata. For DECIMAL and GEOHASH, the path used ColumnType.typeOf() which returns the base type constant, but metadata stores the fully encoded type (with precision/scale/bits). For array types (e.g., DOUBLE[]), the path did not parse array dimensionality brackets, causing type mismatch errors and unconsumed tokens. Unsupported array element types (e.g., INT[]) now correctly report "unsupported array element type" instead of falling through to a misleading error, and unmatched ] brackets after a type name are now detected with a helpful error pointing at the bracket.
This fix adds post-compilation validation in the SQL compiler to reject unconsumed tokens after a valid statement. Previously, trailing content was silently ignored, which could mask user errors. PostgreSQL Wire Protocol compatibility no-op handlers (RESET, CLOSE, UNLISTEN, DISCARD) now require their expected arguments instead of silently accepting bare keywords. BEGIN, COMMIT, and ROLLBACK now properly consume the optional TRANSACTION keyword. The SET statement syntax is now validated to conform to SET [SESSION | LOCAL] name { = | TO } value [, value]*, rejecting malformed forms such as missing name, invalid operator, missing value, and dangling commas.
This fix resolves two related bugs in SqlOptimiser.propagateTopDownColumns0() that caused AssertionError in SqlCodeGenerator.checkIfSetCastIsRequired when UNION sibling models ended up with different topDownColumns counts. The first bug occurred when WHERE clause and timestamp column literals were emitted to union model branches by name resolution; since UNION matches columns by position rather than name, the same alias could resolve to different column indices in different branches, causing one branch to receive extra top-down columns. The fix removes the name-based emission loops in favor of the existing index-based propagation. The second bug occurred when a GROUP BY or SAMPLE BY model as the first UNION member added non-aggregate key columns to its own topDownColumns without propagating them to union siblings. The fix ensures these columns are propagated to all union siblings whose topDownColumns are already populated.
This fix addresses performance degradation when exporting tables with 1M+ distinct symbols to Parquet. Previously, the large default batch size of 1M rows from the general CREATE TABLE AS SELECT setting prevented frequent batch commits, causing symbol index re-scaling to be deferred and degrading performance as the symbol table grew without capacity adjustments. A new configuration property cairo.parquet.export.batch.size with a default of 100K rows is now used specifically for Parquet exports.
When CREATE TABLE ... AS (SELECT ...) fails during data population with a non-Cairo exception, the partially created table and its name lock were not cleaned up. This also affects COPY parquet export, which uses temp tables internally. This fix adds fallback cleanup in COPY parquet export to match the existing HTTP export pattern.
This fix reads the len byte in DecimalBinaryFormatParser as unsigned (& 0xFF) instead of signed. A malformed Influx Line Protocol message with a high len byte (e.g. 0x80) was interpreted as negative, which skipped the VALUES parsing state and left unscaledValues empty. The subsequent load() call then hit an ArrayIndexOutOfBoundsException accessing index 0 of the empty list.
When a view name is quoted (single or double quotes), SqlParser passed the quoted token directly to getTableTokenIfExists(), which failed to find the view. This affected three code paths: SELECT ... FROM 'my_view', SELECT ... JOIN 'my_view' ON ..., and COMPILE VIEW 'my_view'. This fix adds unquote() to all three getTableTokenIfExists() calls so the view is recognized regardless of quoting.
When a CASE/WHEN expression contains a window function (e.g., lag() OVER (...)), column references outside the window function (in THEN/ELSE branches) were not emitted to the translating model. This caused "invalid column" errors when using the original column name alongside a SELECT alias (e.g., price AS p then THEN price). This fix introduces replaceWindowFunctionOrLiteral() that chains window function replacement with literal emission in a single tree traversal pass, so all column references are propagated through the model chain.
This fix introduces CompiledTickExpression, which pre-parses tick expressions containing date variables ($now, $today, $yesterday, $tomorrow) into a single long[] intermediate representation at compile time. Runtime evaluation performs only long arithmetic with no string parsing or allocations. A new DateVariableExpr encodes and evaluates $variable ± offset expressions supporting all time units plus business days. Expressions containing $ variables now produce a dynamic CompiledTickExpression instead of re-parsing the string on every query execution. The evaluation algorithm walks elements to emit intervals, applies day filter bitmasks in local time, handles timezone conversion (numeric subtraction or DST-aware toUTC), applies exchange schedule filtering with optional duration, and sorts and merges overlapping intervals. Additionally, Character.isWhitespace and Character.isDigit calls were replaced with Chars.isAsciiWhitespace and Chars.isAsciiDigit throughout IntervalUtils to avoid locale-dependent behavior.
This fix moves the circuit breaker check in DatabaseCheckpointAgent.checkpointCreate() from after ff.sync() to before it. The POSIX sync() system call flushes all dirty filesystem buffers system-wide, not just QuestDB's files. On busy hosts with heavy I/O from other processes, sync() can block for well over the 60-second query timeout. Previously, the circuit breaker was only checked after sync() returned, so a completed checkpoint was discarded when the timeout had been exceeded during the blocking call. With this fix, if the timeout is already exceeded from the table loop or lock acquisition, the operation fails fast without entering a potentially long-blocking sync(). If the timeout has not been exceeded, sync() runs to completion and the checkpoint succeeds regardless of how long sync() takes.
FILL(LINEAR) was disregarding FROM, meaning it would return calendar-aligned results instead of FROM-aligned results. This fix ensures FROM anchors the initial timestamp and TO applies as a bound. There is no intent for this to support fill expansion (i.e., filling before and after the dataset); it is unclear at this stage how pre- and post-filling should operate with linear fills.
Tick expressions containing date variables ($now, $today, etc.) used as filter predicates — e.g., WHERE now() IN '$now-100s..$now' — resolved $now once at compile time and cached the resulting interval as static timestamps. On repeated execution of a cached SQL statement, the interval never updated, causing the filter to return zero rows once now() moved past the originally computed range. The interval model path was already fixed to compile such expressions into IR via CompiledTickExpression, but the filter function path in InTimestampTimestampFunctionFactory was not updated and still used the static function. This fix detects date variables in constant tick expression strings and routes them through CompiledTickExpression, which re-evaluates the interval from IR on each execution via init().
This fix addresses a vulnerability where ChunkedContentParser.parseChunkLength() had no overflow guard on the hex chunk-size parsing loop. A crafted HTTP request with 16+ hex digits in the chunk size would overflow the long chunkSize to a negative value, corrupting the internal buffer pointer. On the next loop iteration, isEol() would dereference the corrupted pointer, causing the JVM to crash with SIGSEGV. This was reachable over the network with an unauthenticated POST request using Transfer-Encoding: chunked. The fix adds an overflow guard before each chunkSize * 16 accumulation step: if chunkSize > Long.MAX_VALUE >>> 4, the next multiply would overflow a positive long, so the input is rejected as a protocol violation. The server logs the violation and disconnects the client cleanly.
This fix resolves a crash where WINDOW JOIN and HORIZON JOIN would throw a NullPointerException when the right-hand side (slave) is a subquery that applies a timestamp interval filter or reorders columns. The root cause was that SelectedRecordCursorFactory and ExtraNullColumnCursorFactory did not implement newTimeFrameCursor(), which is required by the concurrent time frame cursor infrastructure used in these joins. Additionally, SelectedPageFrameCursor did not implement TablePageFrameCursor, causing a ClassCastException. After fixing those, a subtler double column remapping bug remained: both SelectedConcurrentTimeFrameCursor and SelectedPageFrameCursor were independently remapping column indices through columnCrossIndex, causing symbol table lookups to hit wrong columns and the timestamp column to be read from the wrong position in the address cache. This fix extracts ConcurrentTimeFrameCursor into an interface, adds newTimeFrameCursor() to the affected factories, and simplifies SelectedConcurrentTimeFrameCursor to delegate column remapping entirely to SelectedPageFrameCursor, avoiding double remapping.

February 4, 2026

This release introduces Kubernetes secrets file support for enhanced security in containerized deployments, adds TICK exchange calendar support for financial market applications, and delivers significant Parquet read performance improvements through advanced optimization techniques. The update also addresses several critical bug fixes including a rare WAL file error on Windows replicas, backup log message accumulation issues, and Parquet export corruption under concurrent connections.

New Features

This feature enables reading sensitive configuration values from files using the _FILE suffix convention, allowing native Kubernetes secret file mounts without requiring shell scripts or init containers. It works with both environment variables (QDB_ACL_ADMIN_PASSWORD_FILE) and properties (acl.admin.password.file). All properties marked as sensitive=true are supported, including acl.admin.password, acl.oidc.tls.keystore.password, replication.object.store, cold.storage.object.store, backup.object.store*, and authentication passwords. The _FILE variant takes precedence over direct values, with fallback to direct values when no _FILE variant exists.
This feature adds enterprise exchange calendar support, enabling TICK expressions to filter time intervals by stock exchange trading schedules using syntax like 2025-01-24#XNYS for NYSE trading hours. The implementation includes an EntExchangeCalendarService that loads trading schedules for 67 global exchanges from a bundled Parquet file, supports custom schedule overrides via a _exchange_calendars_custom table, and provides lazy-loading with concurrent access protection. It includes a reload_exchange_calendars() function for managing custom schedules and an exchange_calendars() table function for querying effective calendar data. The feature handles multi-session trading days, lunch breaks, and case-insensitive exchange codes.
This feature implements late materialization optimization for Parquet partitions in parallel query execution with selective filters. Instead of decoding all columns upfront, it first decodes only filter columns to identify matching rows, then decodes other columns only for rows that passed the filter, significantly reducing decoding overhead for low-selectivity queries. The feature introduces SlicePageReader for zero-copy memory-mapped page reading, bypassing intermediate Vec<u8> copies. Additionally, it defaults partitionEncoderParquetRawArrayEncoding to true for better performance by using raw array encoding instead of nested LIST decoding.

Bug Fixes

This fix resolves several race condition issues related to replication. Previously, WAL downloader tasks could leak when cancelled, causing conflicts with WalPurgeJob over file locks. The fix replaces JoinSet::shutdown with a CancellationToken to properly notify and wait for all tasks to complete, preventing file handle leaks on Windows. Additionally, this fix replaces the file-based lock mechanism with a centralized semaphore-based system using WALSegmentLockManager to coordinate access between WalPurgeJob, WalWriter, and the WAL downloader.
This fix addresses a race condition where QuestDB startup failures (such as invalid configuration) would cause confusing cat and rm error messages when the hello.txt file is removed by deleteOnExit() before print-hello.sh can process it. The fix redirects stderr to /dev/null for these commands and adds quotes around $HELLO_FILE for robustness with paths containing spaces. The actual startup failure details remain available in stdout-*.txt logs.
This fix resolves an issue where concurrent Parquet exports could produce corrupted or truncated files when multiple clients exported simultaneously through the same HTTP worker. The problem occurred because the ExportQueryProcessor stored per-connection state in per-processor fields, causing connection contexts to be overwritten when workers switched between connections. This resulted in Parquet chunks being written to the wrong response, producing invalid data that failed to open in downstream tools.

January 29, 2026

January 28, 2026

This release introduces significant SQL enhancements including new aggregate functions (arg_min, arg_max, bool_and, bool_or, bit operations, geomean), window functions (VWEMA, EMA, percent_rank), and geospatial functions (within_box, within_radius). The update also features the new TICK (Temporal Interval Calendar Kit) for interval literals and improved timestamp predicate handling. Additionally, the release includes important bug fixes for ILP writes, window joins, parquet operations, and performance optimizations for concurrent queries and parquet decoding.

New Features

This feature adds Volume-Weighted Exponential Moving Average (VWEMA) as a new window function with the signature avg(price, kind, param, volume). It supports three smoothing modes: 'alpha' for direct smoothing factor (0 < α ≤ 1), 'period' for EMA-style period where α = 2/(period+1), and time units ('second', 'minute', 'hour', 'day') for time-weighted decay with tau parameter. The VWEMA formula calculates numerator = α × price × volume + (1-α) × prev_numerator, denominator = α × volume + (1-α) × prev_denominator, and VWEMA = numerator / denominator. For time-weighted mode, α = 1 - exp(-Δt / τ).
This feature implements arg_min(value, key) and arg_max(value, key) aggregate functions that return the value of the first argument at the minimum/maximum value of the second argument. These functions support 18 type combinations including double, timestamp, long, and uuid types for both value and key parameters. The functions include full parallel execution support with proper merge logic, correct null handling for both keys and values, and UUID comparison using unsigned long comparison. Null keys are ignored, and the functions return null if the value at the min/max key is null.
This feature implements Exponential Moving Average (EMA) as a window function accessible via avg(value, kind, param) syntax. It supports three modes: 'alpha' for direct smoothing factor (0 < alpha ≤ 1), 'period' for N-period EMA where alpha = 2 / (N + 1), and time units ('second', 'minute', 'hour', 'day', 'week') for time-weighted decay using alpha = 1 - exp(-Δt / τ). The function works with both microsecond and nanosecond timestamp precision via TimestampDriver and supports PARTITION BY for computing separate EMAs per group. NULL values are skipped while preserving the previous EMA value.
This feature adds bool_and(T) aggregate function that returns true if all values are true, and bool_or(T) aggregate function that returns true if any value is true. Both functions support parallel execution via merge method, work with GROUP BY and SAMPLE BY clauses, and accept boolean expressions as arguments. The functions include constant folding optimization that returns results directly without row-by-row computation when the argument is constant.
These functions perform bitwise operations on all non-null values in a column, supporting byte, short, int, and long data types. The BIT_AND() function returns the bitwise AND of all values, BIT_OR() returns the bitwise OR, and BIT_XOR() returns the bitwise XOR. All functions include proper null handling (nulls are skipped), constant folding optimization, and parallel execution support. They return null for empty tables or all-null inputs and can be used with GROUP BY clauses.
This function computes the geometric mean of positive numbers using the formula exp(avg(ln(x))) to avoid overflow with large products. It accepts double values (other numeric types convert implicitly) and includes full parallel execution support. The function returns null for negative values, zero values, or empty groups, following DuckDB semantics. Constant folding optimization is included where geomean(c) = c for any positive constant, avoiding aggregate machinery overhead.
This feature enables the query optimizer to recognize DATEADD() calls in time intrinsics and transform them for better performance. When a query contains WHERE dateadd('m', 15, timestamp) = '2022-03-08T18:30:00.000Z', the optimizer transforms it to WHERE timestamp = dateadd('m', -15, '2022-03-08T18:30:00.000Z'), allowing existing interval-based partition pruning to work effectively. The implementation adds AST rewriting to WhereClauseParser and handles comparison operators and BETWEEN expressions.
This feature removes the restriction that prevented window functions from being used as arguments to other functions, enabling queries like SELECT abs(row_number() OVER ()) FROM t and nested window functions such as SELECT sum(row_number() OVER ()) OVER () FROM t. The implementation includes O(1) hash-based deduplication that computes identical window functions only once, significantly improving performance for queries with duplicate window function calls. This feature adds a nesting depth limit of 8 levels to prevent excessive recursion and provides clear error messages when window functions are incorrectly used in PARTITION BY or ORDER BY clauses of window specifications.
This feature adds four new SQL functions optimized for spatial filtering, particularly useful for indexing lidar scans and local area queries. The within_box() function checks if a point is within a rectangular bounding box using Cartesian coordinates, while within_radius() performs circular radius checks. For geographic coordinates, geo_within_radius_latlon() and geo_distance_meters() use equirectangular projection for fast local distance calculations with ~0.1% accuracy for distances under 100km. The implementation includes branchless bit manipulation for optimal vectorized execution, constant folding optimization for compile-time precomputation when reference points are constants, and comprehensive coordinate validation for latitude (-90 to 90) and longitude (-180 to 180) ranges.
This feature enables the query optimizer to recognize OR combinations of timestamp IN predicates as intrinsic interval filters, allowing efficient interval forward scans instead of falling back to row-by-row JIT filtering. Previously, queries like WHERE timestamp IN '2018-01-01' OR timestamp IN '2018-01-02' would use slow JIT filters, but now they utilize the same efficient interval scanning as single IN predicates. The implementation extends the WhereClauseParser to extract timestamp intrinsics from OR trees and adds interval union functionality to combine multiple date ranges into optimized scan operations.
This feature adds the LENGTH_BYTES() SQL function that returns the number of bytes in a varchar argument. The implementation includes performance optimizations for MIN(varchar) and MAX(varchar) group-by functions by using prefix-first comparison with 6-byte prefixes stored in aux memory, and 8-bytes-at-a-time comparison using longAt() instead of byte-by-byte access when prefixes match.
This feature introduces TICK, a powerful syntax for expressing complex temporal intervals in QuestDB. It enables concise specification of multiple disjoint time intervals with timezone awareness in a single expression. Key capabilities include bracket expansion for dates and times, range expansion with inclusive ranges, timezone support with DST awareness, day-of-week filtering, multi-unit duration specifications, ISO week date format support, dynamic date variables with arithmetic, and Cartesian product combinations. TICK supports features like workday filtering, per-element timezones, business day arithmetic, and complex scheduling patterns.
This feature enables pushing timestamp predicates through virtual models where the timestamp column is derived from a dateadd() function, allowing partition pruning to work even when queries transform the timestamp using dateadd. The implementation detects dateadd(unit, constant, timestamp) patterns in SELECT clauses and annotates models with offset info, wraps pushed predicates in and_offset(predicate, unit, offset) during optimization, and supports chained dateadd through multiple nested subquery levels. Auto-detection of timestamp columns from dateadd expressions works without requiring explicit timestamp(ts) clauses.
This feature implements the SQL standard PERCENT_RANK() window function that returns the relative rank of the current row calculated as (rank - 1) / (total_rows - 1). The function returns 0 if there is only one row in the partition. The implementation includes three specialized classes to handle different use cases: no ORDER BY clause (returns 0 for all rows), ORDER BY without PARTITION BY (uses two passes to compute ranks), and both PARTITION BY and ORDER BY (tracks per-partition row counts).
This feature extends the TICK (Temporal Interval Calendar Kit) date variable arithmetic to support additional time units beyond days and business days, including years (y), months (M), weeks (w), hours (h), minutes (m), seconds (s), milliseconds (T), microseconds (u), and nanoseconds (n). Date variable expressions can now be written with or without brackets, making syntax more flexible. Calendar-aware units properly handle varying month lengths and leap years, while sub-day units preserve full microsecond/nanosecond precision when used with $now. This enables precise timestamp arithmetic in interval expressions for high-frequency data analysis.

Improvements

This improvement reduces contention across concurrent queries by avoiding unnecessary partition closures when partitions are in standby for lazy open and by performing mmap operations outside of synchronized blocks in MmapCache. These changes eliminate bottlenecks that could impact query performance under high concurrency scenarios.
This improvement optimizes queries like SELECT dateadd('m', -15, max(ts)) FROM t where MAX() aggregate function is used as a function argument. The optimization moves rewriteSingleFirstLastGroupBy after rewriteSelectClause to enable better query execution plans for timestamp aggregation functions.
This improvement optimizes Parquet partition read performance through four key enhancements: reduces memory allocations during page decoding by eliminating intermediate copies for fixed-width columns and batching memory operations; switches default compression from ZSTD(level=9) to LZ4_RAW for faster decompression at the cost of slightly larger files; uses REQUIRED repetition for non-null Symbol columns to skip definition-level decoding; and skips redundant decode operations on ParquetBuffers cache hits which particularly benefits ASOF JOIN scenarios. Combined, these changes deliver approximately 6x read performance improvement compared to the previous implementation, with query times improving from ~17 seconds to ~3.56 seconds in benchmark tests.
This feature adds streaming outputs support and unifies generate/fix/chat/schema flows in a single interface. The enhancement includes a retry mechanism for failed requests, fixes rerendering issues in the chat window, adds abort signals to model calls, and provides a get_table_details tool for investigating tables() query results. Additionally upgrades Monaco Editor components and resolves content display issues in LiteEditor.

Bug Fixes

This fix resolves an issue where Influx Line Protocol data would be written to the wrong table when a table was renamed while keeping the connection open. The problem occurred because table references were cached in the state of each HTTP connection. The fix introduces a global version counter for renames that triggers cache flushing, WAL rollback, and error return for retry when a rename is detected, preserving request atomicity.
This fix resolves a crash when using window joins with large slave tables. The issue occurred because AsyncWindowJoinRecordCursor.close() was freeing slaveTimeFrameAddressCache before waiting for worker threads to finish, causing workers to access freed memory and trigger segmentation faults. The fix reorders operations to await workers before freeing shared resources and wraps cleanup in try-finally blocks to ensure resources are always freed.
This fix prevents segmentation faults that could occur when a timeout happened during vectorized parallel GROUP BY queries. The issue was caused by frame memory pools used by worker threads being released without properly waiting for all tasks to be processed, creating a race condition where workers could access freed memory concurrently with the aggregation process.
This fix resolves an issue where SQL line comments containing unbalanced single quotes (e.g., -- magic ') would cause the lexer to incorrectly enter string-parsing mode and consume tokens past the newline, resulting in subsequent SQL tokens being lost from the query. The fix improves the line comment parsing logic to properly detect newlines within tokens that span multiple lines and repositions the lexer correctly to reset its state.
This fix resolves failures in GROUP BY queries with many aggregate functions and includes significant performance optimizations for SQL compilation. Added SimpleGroupByFunctionUpdater as a loop-based fallback that avoids JVM bytecode limits when aggregate function count exceeds 32. Optimized alias generation from O(n²) to O(1) amortized complexity using LowerCaseCharSequenceIntHashMap to track sequence numbers. Improved duplicate aggregate detection from O(n²) to O(n) using hash-based comparison with ExpressionNode.deepHashCode(). Query compilation time for queries with 6K aggregate functions improved from 4 seconds to around 300ms.
This fix resolves an issue where symbols defined with DECLARE statements were not properly recognized when used as function arguments during SQL parsing.
This fix resolves Parquet read failures that occurred in window chain join queries. The issue was caused by chained window joins incorrectly using GenericRecordMetadata.copyOf() instead of the required GenericRecordMetadata.deepCopyOf() method.
This fix resolves a bug where WINDOW JOIN queries using aliased columns in expressions with aggregates would fail with an 'Invalid column' error. The issue occurred when the same column was both aliased and used in an expression with aggregates. The fix ensures proper column resolution by passing the correct model to aggregate processing and handling the special case where translating and group-by models are the same in window joins.
This fix resolves an issue where executing multiple PostgreSQL Wire Protocol commands with the same SQL but different parameters caused the second command to return corrupted data. The problem occurred because msgBindCopySelectFormatCodes() incorrectly used factory != null to determine if format codes should be set, but when a prepared statement was reused via copyIfExecuted(), the new entry had factory=null, causing format codes to be skipped and clients to misinterpret binary/text format. The fix replaces the factory check with hasResultSet() which uses sqlType for proper determination.
This fix resolves a bug where ORDER BY <position> failed with "Invalid column" error when used with a CTE containing aggregation and a window function expression. The issue occurred because the position-to-name resolution logic in SqlOptimiser.rewriteOrderByPosition() only checked for GROUP_BY models when determining which model's columns to use, but with window functions the model structure includes a WINDOW model between VIRTUAL and GROUP_BY models. The fix extends the condition to also check for SELECT_MODEL_WINDOW type.
This fix resolves Parquet export compatibility issues with strict readers like PyArrow and Trino for SYMBOL columns spanning multiple row groups. The fix generates a single dictionary page per column chunk instead of incorrectly writing multiple pages, and corrects RLE/bitpack encoding padding to ensure proper format compliance.

January 19, 2026

January 14, 2026

QuestDB 9.3.1 follows the major 9.3.0 release, focusing on stability, correctness, and performance refinements based on early feedback and production usage. This release delivers important fixes across joins, views, and checkpointing, alongside continued performance improvements on hot SQL execution paths.

New Features

This feature implements KSUM() as a window function using the Kahan summation algorithm for improved floating-point precision. This complements the existing KSUM() aggregate function by enabling its use in window contexts and supports all standard window frame types including ROWS, RANGE, partitioned, unbounded, and sliding windows. The function can be used for whole partitions, cumulative sums, sliding windows, and range-based windows.
This feature enables window functions to participate in arithmetic expressions and other operations. Previously, window functions could only appear as standalone SELECT columns. Window function parsing has been moved from SqlParser to ExpressionParser, allowing window functions to be parsed as part of the expression tree. The implementation maintains zero-GC compliance and includes proper nested window function detection to reject invalid contexts while allowing valid use cases like CASE expressions.
This feature exposes the actual data timestamp range in each table through new columns in the tables() function. The table_min_timestamp and table_max_timestamp values are updated when WAL transactions are merged into tables and during startup hydration. Additionally, table_max_timestamp was renamed to table_last_write_timestamp for clarity, and dedup_row_count_since_start was renamed to wal_dedup_row_count_since_start for consistency.

Improvements

This improvement implements streaming Parquet export for HTTP queries, eliminating temporary table/files and improving performance. The feature exports directly from PageFrames when supportsPageFrameCursor() is true, with fallback to creating temp table then streaming export when PageFrameCursor is not supported. Currently available for HTTP /exp endpoint only, this change provides modest performance improvements and enables future parallelization opportunities for row group generation.
This improvement significantly reduces GC pressure and jitter in parallel queries through several key optimizations. The PageFrameAddressCache now uses flat DirectLongList structures instead of nested objects, reducing ~20K object allocations per 1000 frames to just 4 contiguous off-heap lists. Decimal flyweights in map values are lazily initialized to avoid allocations when unused. The GroupByAllocator replaces LongLongHashMap with off-heap DirectLongLongHashMap to prevent per-worker allocations. Page frame size calculations are improved to avoid tiny trailing frames, and work stealing strategy is tuned with a new cairo.sql.parallel.work.stealing.spin.timeout configuration (default 50µs). Benchmarks show 67% fewer GC events, 57% less memory usage, and significantly reduced GC pause times.
This improvement materializes functions within the same-level queryModel that are referenced by other projection columns, eliminating redundant expression evaluations and improving query performance when columns depend on other computed columns.

Bug Fixes

This fix resolves the left side of time series join doesn't have ASC timestamp order error that occurred when ORDER BY timestamp DESC was used in a WINDOW JOIN query.
This fix properly handles tables created after a checkpoint was taken during checkpoint restore. Previously, table directories created after the checkpoint would remain on disk, potentially causing inconsistencies or orphaned data. The solution iterates over all table directories in the database root after recovering checkpoint contents, compares each directory against the checkpoint contents, and removes any table directories that do not exist in the checkpoint.
This fix resolves two race conditions in view state management. The first issue occurred when multiple ViewCompilerJob tasks ran concurrently, allowing a task with an older timestamp to overwrite newer, correct view state. A timestamp check was added to updateViewState() to skip updates when the update timestamp is older than the current view state timestamp. The second issue involved views being created with empty metadata that was hydrated before asynchronous compilation completed, causing queries to see empty designatedTimestamp in tables(). This fix passes compiled RecordMetadata directly to createViewState() so proper column information is available immediately.
This fix resolves an issue where views became suspended after being altered due to missing metadata. The fix ensures metadata is properly restored after view alterations, preventing the suspended state that would make views unusable.
This fix addresses an edge case in materialized view column merging where fixed columns could be corrupted during merge operations. The fix ensures data is copied to the correct memory location (end of old data + column tops) rather than to the end of newly calculated data size, preventing potential data corruption in materialized views.
This fix resolves an issue where the dedup logic would leave behind unused partition directories from O3 merge preparation when appending data directly to existing partitions. The Partition Purge job would then incorrectly count these orphaned directories as the next valid partition version, leading to incorrect partition version tracking and subsequent file not found errors.
This fix resolves UnsupportedOperationException crashes in ASOF JOIN queries when using the Light join factory with multiple join keys including a symbol column. The single-symbol optimization was incorrectly applied when there were multiple join keys, creating a map that only supported INT keys while the record copier attempted to write all column types. In some cases this could also silently return incorrect results.
This fix resolves a race condition where jemalloc background threads were failing to start reliably (~20% failure rate) when QuestDB was launched via questdb.sh, preventing memory from being released back to the OS properly. The solution switches to QuestDB's own jemalloc fork and updates to the latest dev branch commit. Additionally, LD_PRELOAD is now used to load jemalloc only in the Java process rather than the bash script.
This fix removes unwanted scrolling behavior on wheel events from code blocks and assistant suggestions, adds "Open in editor" action to all overflown LiteEditor instances, prevents unnecessary rerenders in assistant response markdown preview, and resolves a bug with context badge clicks where highlighting would fail on removed or modified context queries.
This fix resolves a flickering issue when hovering over news thumbnail images in the right panel. The zoom magnification was rapidly appearing and disappearing due to hover state issues caused by the hoverTimeout variable resetting on every re-render. The fix uses useRef for hover timeout persistence across re-renders and tracks imageToZoom state with a ref to prevent dismissal when zoom is already visible. Additionally, click-to-dismiss functionality was added on both the overlay and zoomed image for better user experience.
This fix addresses a regression where HTML entities like  , <, and > displayed as literal text instead of their corresponding characters in grid cells. An unescapeHtml function was added to decode HTML entities to their actual characters before setting textContent, restoring proper entity rendering while maintaining XSS protection that was implemented in the previous security fix.

January 9, 2026

QuestDB 9.3.0 is now available, bringing a new wave of query expressiveness and usability improvements across the engine and the Web Console. This release introduces window joins for precise time-based analytics, database views for cleaner query composition, AI-assisted workflows in the web console, and the new PIVOT keyword for effortless wide-schema aggregations.

New Features

This feature introduces a new WINDOW JOIN syntax that enables time-based aggregations by joining each row from the left table with matching rows from the right table within a specified time window. The syntax supports both INCLUDE PREVAILING (default) and EXCLUDE PREVAILING clauses, where the former includes the latest right-hand table row before the interval. The implementation uses SIMD instructions for symbol-based joins and supports multi-threaded execution for large datasets. Window joins can be chained in the same query but cannot be mixed with other join types at the same query level.
This feature refactors the glob() function to use native glob matching instead of Java regex, providing compatibility with DuckDB's glob syntax. The implementation supports standard glob wildcards including * for any number of characters, ** for subdirectories, ? for single characters, [abc] for character sets, and [a-z] for character ranges.
This feature improves table drop performance by closing idle WAL writers faster, which enables quicker deletion of WAL and table files from disk. Previously, dropped tables would generate repeated logging about pending WAL segments, but now the cleanup process completes more efficiently.
This feature enables using VARCHAR array bind variables with symbol, varchar, and string columns in IN expressions through PostgreSQL Wire Protocol. This allows reusing prepared statements with different value lists, reducing query compilation overhead for applications that frequently execute similar queries with varying parameter sets.
This feature pushes column projection down to the Parquet decoder, reading only required columns instead of all columns. The optimization also shares Parquet metadata across threads for multi-threaded reads, parsing metadata once per file. Performance improvements are dramatic - ClickBench Q1 execution time improved from 712 seconds to 787ms with multithreading enabled.
This feature adds real-time table write statistics tracking through new columns in the tables() SQL function. New columns include rowCount (approximate row count), pendingRowCount (WAL rows not yet applied), dedupeRowCount (cumulative deduplicated rows), lastWriteTimestamp (last commit timestamp), writerTxn and sequencerTxn (transaction numbers), memoryPressureLevel (0-2 pressure indicator), and various histogram statistics for transaction sizes, write amplification, and merge throughput. The feature uses a new RecentWriteTracker with lock-free reads and bounded memory, supporting both WAL and non-WAL tables with different column semantics. Statistics are approximations updated when writers return to pool, with WAL tracking being real-time and LRU eviction maintaining memory bounds.
This feature introduces the PIVOT keyword, which is a specialized GROUP BY query that helps group a selection of rows into columns, essentially pivoting from a narrow schema to a wide schema. The implementation supports single and multiple columns, per-column aliasing, FOR-IN expressions with single or multiple expressions, GROUP BY, ORDER BY, LIMIT, and dynamic IN lists using subqueries. The feature also includes UNPIVOT functionality to reverse simple pivot operations.
This feature introduces support for database views, which are virtual tables defined by a SELECT statement. Views do not store data themselves; instead, their defining query is executed as a sub-query whenever the view is referenced. The implementation includes CREATE VIEW, DROP VIEW, ALTER VIEW, CREATE OR REPLACE VIEW, COMPILE VIEW, SHOW CREATE VIEW, and views() commands. Views automatically recompile when operations occur on their dependencies and support metadata persistence, state management, and integration with the Web Console.
This feature introduces a comprehensive AI Assistant with conversational interface in the right sidebar, enabling users to interact with AI models (Anthropic Claude & OpenAI GPT) for SQL assistance. The assistant provides query explanations, automatic error fixing, schema explanations, and SQL generation from natural language. It includes Monaco Editor integration with glyph margin icons, inline diff view for reviewing AI suggestions, and persistent chat history stored in IndexedDB. The feature supports multiple AI providers with configurable models and includes token usage tracking and conversation compaction for context management.

Improvements

This improvement prevents LimitRecordCursorFactory.toPlan() from calling the expensive calculateSize() method during EXPLAIN queries, which previously executed the complete query. The optimization also avoids unnecessary calculateSize() calls in hasNext() when LIMIT bounds are non-negative and in size() when the size isn't cheaply available. Additionally, this improvement clarifies LIMIT semantics to align with documentation and fixes related bugs in cursor classes.
This improvement introduces comprehensive optimizations to JIT-compiled filter execution including short-circuit evaluation for scalar AND/OR predicate chains, automatic predicate reordering by estimated selectivity, and multiple caching mechanisms to reduce redundant memory loads. The optimization includes SIMD scatter short-circuiting that skips expensive operations when no matches are found in a batch. Performance improvements range from 23-69% faster execution depending on the query pattern, with the biggest gains seen in highly selective filters.
This improvement implements a fast-path for count-only queries that avoids materializing matching rows by incrementing a counter instead. The optimization eliminates row ID scatter logic, memory writes, and conditional branches in the hot loop, reducing loop instructions by ~70%. Performance improvements range from 60-92% for queries with many matching rows, with the biggest gains for NEQ predicates that match large result sets.
This improvement enables parallel processing of the top-K selection phase when GROUP BY results are sharded due to high cardinality. The optimization processes each shard in parallel using worker threads, then merges results. New configuration properties cairo.sql.parallel.groupby.topk.threshold (default 5M) and cairo.sql.parallel.groupby.topk.queue.capacity control when parallel execution is enabled. ClickBench queries show speedups ranging from 1.3x to 11.2x for the ORDER BY + LIMIT phase.
This improvement optimizes non-keyed GROUP BY queries with MIN() or MAX() functions over designated timestamp columns by using O(1) operations instead of SIMD native functions. The optimization takes the first/last value directly, resulting in significant performance gains for queries like SELECT MIN(EventTime), MAX(EventTime) FROM table.

Bug Fixes

This fix addresses a crash that occurred when calling touch() on a table with newly added columns. When a new column is added to a table, it is initially empty and uninitialised, causing frame.getPageAddress to return a null pointer address. The touch() function did not check for this possibility and would attempt to use the null pointer, resulting in a segmentation fault. This fix adds proper null pointer checks for both columns and indexes to prevent crashes.
This fix adds support for the SHOW default_transaction_read_only metadata query, which is required by npgsql when building data sources with multiple hosts. Without this query support, connection sources would fail to initialise when using PostgreSQL Wire Protocol clients that depend on this metadata.
This fix resolves an issue where the tables() function would display an incorrect designated timestamp column when a column type change occurred on any column positioned before the designated timestamp column in the table schema. The function now correctly identifies and displays the designated timestamp column regardless of schema modifications.
This fix resolves a limitation that prevented copying data between tables with thousands of columns (6000+) due to Java bytecode method size restrictions. The solution introduces multiple copy strategies: a looping variant for very wide tables, and a chunked variant that splits large copy operations into smaller sub-methods to maintain optimal performance. The chunked approach allows each sub-method to be individually optimized by the C2 compiler, significantly improving performance for wide table operations while maintaining compatibility with extremely wide schemas.
This fix resolves an issue where nested SAMPLE BY queries with an ORDER BY clause incorrectly failed with "ASC order over TIMESTAMP column is required but not provided". The problem occurred when SAMPLE BY pushed timestampRequired=true onto a context stack, which propagated to nested models even when an ORDER BY clause would re-sort the data anyway. This fix ensures that when an ORDER BY clause is present, timestampRequired=false is pushed to signal that nested models don't need to maintain timestamp ordering.
This fix resolves an issue where EXPLAIN UPDATE on WAL tables was throwing UnsupportedOperationException because the anonymous RecordCursorFactory created during WAL serialization did not implement getCursor(). The fix replaces the anonymous factory with EmptyTableRecordCursorFactory and improves the explain plan output to show table names when available.
This fix resolves issues when using quoted column names containing dots in JOIN operations, ensuring proper metadata handling for such column references.
This fix resolves a logic bug that caused each inserted row during Parquet export processes to be committed individually as single-row commits, significantly impacting performance during data copying operations.
This fix improves native library loading by auto-detecting native libraries in jlink-ed runtime images. The improvement uses java.home to locate native library directories and only activates when running from jrt: protocol (jlink runtime), eliminating the need for -Dquestdb.libs.dir in start scripts.
This fix addresses an overflow issue when incrementing quotient digits during decimal division. The increment loop in DecimalKnuthDivider.java was updated to use the actual length of the q array, preventing possible misses when incrementing quotient digits during rounding and improving the reliability and correctness of the decimal division logic.
This fix resolves an issue where VARCHAR columns larger than the PostgreSQL Wire Protocol send buffer would fail, making behavior consistent with array columns which already supported fragmented sending. Additionally, this fix addresses a latent bug where the partial-send array offset wasn't being reset when a row couldn't fit in the buffer and had to be abandoned via resetIncompleteRecord(), which would create malformed packets on retry.
Query keys are in form: ${queryText}@${startOffset}-${endOffset}. With DECLARE statements, the Web Console was parsing queries incorrectly, causing the AI chat window to open with wrong information and the editor to shift query keys incorrectly.
This fix prevents execution of foreign code inside the Web Console through malicious queries containing HTML iframe elements with embedded scripts.
This fix adds functionality to abort queries in progress while the query body is being consumed. For running query and running all queries in a tab, it aborts the ongoing process and starts the new one if a new operation comes before the old one is completed. It also prevents the editor from inferring '\t' as a valid query start character by treating it the same as ' '.
This fix adds exponential backoff retry with 5 attempts and delays of 1s, 2s, 4s, 8s, 16s (capped at 30s) when POSTing telemetry to fara.questdb.io. The hourly telemetry loop now continues even after max retries are exhausted, and if the /config endpoint fails during startup, the loop still starts and retries the next hour. Previously, a single failed POST would permanently stop the telemetry loop until the user refreshed the page, causing telemetry gaps during transient network issues or brief outages.

December 18, 2025

This release focuses primarily on bug fixes and performance improvements for replication functionality. Key fixes include resolving suspended table issues on replicas, optimizing segment collection at startup, and addressing metadata file corruption after materialized view recreation. Additional improvements were made to SQL query processing and session management stability.

Improvements

This improvement enhances startup performance by modifying the directory walking logic to skip partition traversal and moving the missed segment collection to run asynchronously as part of each table uploader's logic. The change eliminates startup blocking and enables parallel directory traversal across multiple table uploaders.

Bug Fixes

This fix resolves a race condition where the _event file was copied to the segment directory before the _event.i file during replication. When WalEventReader attempted to map the _event.i file using the max transactions value from the already-updated _event file, it could result in reading beyond the file size via mmap, causing a SIGBUS signal and subsequent java.lang.InternalError. The fix ensures that _event.i file is copied before the _event file to maintain proper synchronization.
This fix ensures that outer LIMIT clauses are properly applied even when the subquery contains both a filter condition and its own LIMIT clause, preventing incorrect query result truncation.
This fix resolves use-after-free bugs in HttpSessionStore caused by OidcUserInfoHolder containing CharSequences backed by parser buffers with limited lifetime. The fix uses immutable String copies from CachedUserInfo instead of passing parser buffer references directly, preventing memory corruption when parser state is reused.

QuestDB Releases

QuestDB Enterprise 3.3.3Based on QuestDB OSS 9.4.3

New Features

Improvements

Bug Fixes

QuestDB 9.4.3

New Features

Improvements

Bug Fixes

QuestDB Enterprise 3.3.1Based on QuestDB OSS 9.4.2

New Features

Bug Fixes

QuestDB 9.4.2

New Features

Bug Fixes

QuestDB Enterprise 3.3.0Based on QuestDB OSS 9.4.1

Breaking Changes

New Features

Improvements

Bug Fixes

QuestDB 9.4.1

New Features

Improvements

Bug Fixes

QuestDB 9.4.0

New Features

Improvements

Bug Fixes

QuestDB Enterprise 3.2.5

New Features

Bug Fixes

QuestDB 9.3.5

Breaking Changes

New Features

Improvements

Bug Fixes

QuestDB Enterprise 3.2.4Based on QuestDB OSS 9.3.4

New Features

Bug Fixes

QuestDB 9.3.4

Breaking Changes

New Features

Improvements

Bug Fixes

QuestDB Enterprise 3.2.3Based on QuestDB OSS 9.3.3

New Features

Improvements

Bug Fixes

QuestDB 9.3.3

New Features

Improvements

Bug Fixes

QuestDB Enterprise 3.2.2Based on QuestDB OSS 9.3.2

New Features

Bug Fixes

QuestDB Enterprise 3.2.1Based on QuestDB OSS 9.3.2

New Features

QuestDB 9.3.2

New Features

Improvements

Bug Fixes

QuestDB Enterprise 3.2.0Based on QuestDB OSS 9.3.1

New Features

Bug Fixes

QuestDB 9.3.1

New Features

Improvements

Bug Fixes

QuestDB 9.3.0

New Features

Improvements

Bug Fixes

QuestDB Enterprise 3.1.2Based on QuestDB OSS 9.2.3

Improvements

Bug Fixes

QuestDB Enterprise 3.3.3
Based on QuestDB OSS 9.4.3

QuestDB Enterprise 3.3.1
Based on QuestDB OSS 9.4.2

QuestDB Enterprise 3.3.0
Based on QuestDB OSS 9.4.1

QuestDB Enterprise 3.2.4
Based on QuestDB OSS 9.3.4

QuestDB Enterprise 3.2.3
Based on QuestDB OSS 9.3.3

QuestDB Enterprise 3.2.2
Based on QuestDB OSS 9.3.2

QuestDB Enterprise 3.2.1
Based on QuestDB OSS 9.3.2

QuestDB Enterprise 3.2.0
Based on QuestDB OSS 9.3.1

QuestDB Enterprise 3.1.2
Based on QuestDB OSS 9.2.3