QuestDB Enterprise 3.3.3 is a stability-focused patch release on the QuestDB 9.4.x engine that concentrates on reliability and correctness: hardening replication and backup around snapshot restore and role changes, and pulling in a large batch of Parquet, posting/covering-index, storage and SQL-correctness fixes from the OSS engine. Alongside the fixes it adds two Enterprise capabilities — a hot in-place primary/replica role switch and finer-grained replication control (replication.disabled.tables plus rebase-aware replication).
New Features
This feature introduces two replication capabilities. First,
replication.disabled.tablesprovides a reloadable, comma-separated list of tables to exclude from replication — both the uploader and downloader skip listed tables. Second, replication-aware handling ofALTER TABLE REBASE WALensures a rebased table is re-baselined correctly on replicas instead of being rebuilt from an empty state. The uploader detects a rebased table via a permanent_rebase_newmarker, skips its empty seed transaction, and uploads from seqTxn 2, recordingfirst_txn=2inindex.msgpack. A replica refuses to build such a table from the object store when it is missing the baseline belowfirst_txn, prompting an operator to copy the table and resume. The previously separate replication status checks are consolidated into a singlegetReplicationStatus(dirName)callback returningACTIVE,SUSPENDED, orDISABLED. An additional reloadable switch,replication.primary.pause.upload.on.suspended.tables(defaultfalse), allows pausing uploads of WAL-apply-suspended tables. For permissions,ALTER TABLE … SUSPEND WALandRESUME WALare both authorized by the singleRESUME WALpermission, whileREBASE WALrequiresSYSTEM ADMINprivilege due to its destructive nature.This feature adds controls for managing hard-suspended WAL tables and introduces
ALTER TABLE <t> REBASE WALto rebuild a table under a fresh sequencer. Thecairo.wal.apply.suspended.tablesconfiguration provides a reloadable, comma-separated list of table names thatApplyWal2TableJobskips, preventing WAL transaction application. Whencairo.wal.apply.suspended.write.deniedis set to true, writes to a hard-suspended table are rejected instead of being queued. Non-structuralALTERstatements andFORCE DROP PARTITIONon suspended tables are routed through a WAL-bypass path and applied directly viaTableWriter, while structural changes remain denied.REBASE WALclones the applied table into a new directory via hard links into a.rebase/staging directory, resets_txnand_metawith a newtableId, seeds two empty transactions, swaps the name registry, and drops the old directory. Preconditions require the table to be a WAL table, hard-suspended, andcairo.wal.apply.suspended.write.denied=true. Rebasing a base table invalidates dependent materialized views (recoverable withREFRESH MATERIALIZED VIEW <v> FULL), while rebasing a materialized view itself re-registers it for a full refresh. Pending unapplied WAL transactions are discarded by a rebase. For permissions,SUSPEND WALandRESUME WALshare the same authorization path, whileREBASE WALrequires system-admin privilege due to its destructive nature.This feature introduces opt-in memory limits that cap how much native memory a single bounded workload may allocate, throwing at the offending allocation site when the cap is crossed so a runaway is stopped at its source while unrelated workloads keep running. Three independent, dynamically reloadable configuration properties control the limits:
cairo.query.memory.limit.bytesfor user SQL queries,cairo.mat.view.refresh.memory.limit.bytesfor materialized view refreshes, andcairo.wal.apply.memory.limit.bytesfor WAL apply. All default to0(unlimited). On breach, the engine throws aCairoExceptionwith a distinct message identifying the workload type, query ID, limit, and memory tag. AMemoryTrackerwraps a 16-byte native block shared with Rust, acquired per workload viaQueryRegistry.register/unregister. Coverage includes map family, sort/tree chains, hash-join chains,FastGroupByAllocatorand function state,LATEST BYrowid lists and maps, set-operation maps, encoded sort, secondary joins, window/horizon-join aggregations,SAMPLE BYfill, and Parquet decode buffers. The vectorized (Rosti) keyedGROUP BYhash tables remain on the global RSS counter only. Thequery_activityview gainsmemory_usedandmemory_limitcolumns. Tracker-aware pooled memory classes release their native backing on cursor close and re-allocate on next use to ensure each malloc and its matching free are charged to the same tracker.This feature introduces statistical aggregate functions for computing kurtosis and skewness. The implementation uses Pébay's one-pass online algorithm (an extension of Welford's) for numerical stability, maintaining running mean and central moments (M2, M3, M4) per group. Partial aggregates merge with pairwise-combine formulas, enabling parallel
GROUP BYexecution. Sample skewness returnsnullfor fewer than 3 values; sample kurtosis for fewer than 4. The sample variants apply Fisher's bias correction. Population variants returnnullfor empty groups. All variants returnnullwhen every observation is equal (zero variance). Example usage:SELECT skewness(price), kurtosis(price) FROM trades WHERE symbol = 'BTC-USD' SAMPLE BY 1h;
Improvements
This improvement introduces an internal continuation runtime built on
jdk.internal.vm.Continuationthat allows SQL functions to yield their carrier worker thread while waiting, so the number of concurrent calls is no longer bounded by the worker pool size. Thewait_wal_table(table_name [, seq_txn])function blocks the current query until the WAL writer has applied transactions up to a target sequencer transaction number, returningtrueon success. It respects the SQL circuit breaker for query timeout, explicitCANCEL QUERY, and broken client connections. Thesleep(seconds)function pauses the current query for a given duration (up to 24 hours) and returns a singleTIMESTAMProw. Both functions release their worker carrier while parked. JDKThreadLocalis replaced with a carrier-keyedCarrierLocalalong continuation-critical paths to prevent stale thread identity across yield/resume boundaries. Two new configuration properties are available:cairo.timer.shardscontrols the number of daemon threads driving deadline-based wakeups, andgriffin.query.continuation.wake.intervalsets the millisecond interval for circuit breaker probing during park.INSERT INTO trades VALUES (now(), 'AAPL', 100, 150.0); SELECT wait_wal_table('trades');This improvement extends the encoded radix sort path to every column type that
ORDER BYaccepts and to keys of any width, across both full-sort and serial/parallel top-K paths. Previously, only fixed-width columns fitting within 32 bytes used the fast radix sort; everything else fell back to a red-black tree sort with O(log n) comparator calls per row. TheSortKeyEncodernow normalizes each sort column into a byte-comparable segment: VARCHAR uses shifted UTF-8 bytes with a terminator, STRING and non-static SYMBOL use UTF-16BE with escaping, and UUID/LONG256 use unsigned big-endian with an order-preserving null remap. Variable-width keys use a 16-byte inline prefix with overflow into a key heap, and the native kernel finishes tied partitions with pdqsort. Top-K builds entries in anEncodedTopKBufferthat rejects rows up front when their leading word is beyond the kept boundary, skipping full key encoding. Benchmarks on the ClickBenchhitstable show improvements from 52ms to 37ms atLIMIT 10and from 100ms to 85ms atLIMIT 100000. The key heap is bounded bycairo.sql.sort.key.max.bytes, and the tree path remains available viacairo.sql.orderby.sort.enabled=false. This change also fixes a pre-existing bug wheregenerateCastFunctionshad no LONG128 case, causingUNIONqueries with both a SYMBOL column requiring casting and a LONG128 column to throwUnsupportedOperationException.This improvement addresses persistent native memory consumption from cached parallel
GROUP BY, top-K, and ASOF/window join factories. Previously,AsyncFilterContext.clear()freed page-frame memory pools but left per-workerDirectLongListrow-id buffers at their peak size (up to 8 MB each at the default 1M row page frame), only freeing them on query-cache eviction. Nowclear()callsresetCapacity()on the owner and per-worker lists, returning them to the 256-entry initial capacity. The tradeoff is one reallocation when a cached factory is reused for another large scan. This improvement also includes a safety fix for a native memory leak:DirectLongList.resetCapacity()could re-malloc a closed list whenclear()was called afterclose(), which occurred in the four horizon-join factories on a failedgetCursor(). The fix adds a guard to skip closed lists and reorders cursor/frameSequence cleanup in horizon-join factories to match the GROUP BY/TopK factories.This improvement eliminates redundant per-row evaluation of subexpressions whose value is invariant across all rows of a cursor but not known at compile time. A typical example is a time-window threshold like
dateadd('d', -30, to_timezone(now(), 'Asia/Kolkata'))inside aCASEbranch or filter predicate, wherenow()is already cached but the surrounding function calls were re-evaluated on every row. A newRuntimeConstFunctionwrapper evaluates the subtree once duringinit(), caches the result in a primitive field, and serves it from every getter.FunctionParserwraps only the maximal runtime-constant subtree at function boundaries, avoiding double-wrapping. Trivial runtime-constant leaves such as bind variables andnow()that already cache their values are skipped to avoid unnecessary indirection. Only fixed-width scalar types (timestamp, long, int, double, date, boolean, ipv4, geo, uuid, etc.) are folded; variable-length types are left for future work. The wrapper delegatestoPlan()to its argument, soEXPLAINoutput is unchanged.This improvement reduces redundant I/O during checkpoint and snapshot restore by collapsing per-partition Parquet metadata sidecar (
_pm) processing from up to two maps and three CRC verifications down to a single map and single CRC per partition, all executed in parallel. Previously, the serial validation pass dominated restore startup on cold, object, or network storage. The recovery pool now submits one worker per thread, each pulling partitions from a shared cursor with dynamic load balancing and reusing native scratch objects across partitions. The committed-size truncation check ondata.parquetruns inside the worker, removing the O(N) serialff.length()loop. Additionally, the non-Parquet bitmap index rebuild now parallelizes across individual (partition, column) work items rather than whole partitions, so a non-partitioned table with several indexed symbol columns no longer serializes all index rebuilds onto one thread. A behavioral tradeoff is that fail-fast detection of truncated captures now occurs at the parallel drain point rather than strictly before sibling work starts, though the first failing worker trips a shared abort latch that short-circuits all other workers.
Bug Fixes
This fix addresses a bug where a primary snapshot could capture a Parquet partition whose on-disk
data.parquetwas a later generation than the committed Parquet size in_txn. An in-place O3 rewrite (partitionMutates=false) grows the file before_txnadvances. When the replica restored that torn pair and replayed the O3 merge over the partition, the stale_pmcaused the merge to decode a column chunk past the committed size, resulting in a "File out of specification: Column chunk range exceeds data length" error that suspended the replica table. This fix regenerates a partition's_pmduring snapshot/checkpoint restore when the restoreddata.parquetis longer than the committed size.This fix addresses an out-of-memory condition in the posting index seal and rollback paths on memory-constrained instances ingesting skewed symbol columns. The
sealIncrementalbuffers were sized atDENSE_STRIDE (256) * preAllocPerKey, resulting in up to 256x over-allocation on skewed columns. This fix right-sizes the buffers to the actual dirty-stride aggregate, adds RSS pre-flights to both the seal and rollback paths, and streams every rollback cover shape per-key instead of decoding the whole index into one buffer. The fix also closes several latent safety issues including native out-of-bounds writes in unbounded posting decoders on corrupt or torn input, and a seal-purge reuse-race that could delete a live file. The tradeoff is that the incremental-seal pre-flight is conservative and may defer to the slower full seal when the incremental path would have just fit, and the streaming rollback re-derives each surviving key rather than holding the whole index, trading per-key CPU on the rare rollback path for a peak-memory bound.A snapshot could capture a Parquet partition whose on-disk
data.parquetwas a later generation than the committed Parquet size recorded in_txn. The restore path trusted the partition's existing_pmsidecar in that state and left it stale, causing the first read or merge after restore to fail with a "Column chunk range exceeds data length" error. On a replica replaying the WAL over the restored partition, this suspended the table and broke replication. The root cause was that the validation only checked whether the sidecar could resolve a footer at the committed size, but a_pmfrom the later generation still resolved successfully despite being stale. This fix now only trusts an existing_pmwhendata.parquetis exactly the committed size. When the file is longer (indicating the snapshot captured the partition mid in-place rewrite), the_pmis regenerated from the committed size, which produces a correct compact sidecar for the committed footer.This fix addresses a production out-of-memory incident on a memory-constrained instance ingesting a skewed symbol column. The incremental seal path (
sealIncremental) sized its per-stride merge and trial buffers as if all 256 keys in a stride held the single hottest key's row count, inflating allocation by up to ~256x beyond actual data. The buffers are now sized to the actual aggregate of dirty strides computed from per-key counts, and a pre-flight defers to the full seal (which streams per key) when the correctly sized buffers would still breach the RSS limit. The rollback path previously decoded the entire partition index into one buffer before filtering, which was the exact memory-heavy case that triggered the incident. Rollback now streams per key for every cutoff, bounded by the largest single key rather than the whole index. All covering index shapes (fixed-size, var-size, and addr-based) now stream their sidecar rebuilds. Error handling was hardened across seal and rollback paths: failures after the value-file switch poison the writer so all mutating entry points reject further operations, while failures before the switch clean up staged files and restore writer state. Several latent safety bugs were also fixed, including unbounded native writes in posting decoders on corrupt input (now rejected with a clean error) and a seal-purge reuse race that could delete a live file after a sealTxn was freed and republished.This fix addresses a remaining out-of-memory path in the posting index incremental seal that was not covered by the prior streaming-fallback RSS pre-flight. The incremental seal snapshots each covered column's entire sealed sidecar into native memory before the stride-merge pre-flight runs, so a skewed symbol with a covered (
INCLUDE) column could still trigger an unguarded multi-gigabytemallocthat exceeded the global RSS limit during a WAL fast-lag commit. The fix adds the same headroom pre-flight before the snapshot copy, re-reading RSS usage per cover column to account for snapshots already taken. When the copy would not fit the live headroom, the incremental path is abandoned and the full seal is used instead, which rebuilds every sidecar by streaming per-key from the column files with peak memory bounded by the largest single key rather than the entire sidecar.This fix resolves an intermittent
NullPointerExceptioninCoveringIndexRecordCursorFactory.getCursorthat occurred when multiple queries ran concurrently, such as all panels of a Grafana dashboard firing at once over PostgreSQL Wire Protocol. The factory stored a direct reference to the compiler's pooledkeyValueFuncslist fromIntrinsicModel. Since SQL compilers are pooled and shared across threads, a second thread borrowing the same compiler would clear that shared list (viaIntrinsicModel.clear()), nulling out slots while the first thread's factory was still reading them duringgetCursor(). The fix creates an owned copy of the list in the constructor, preserving the sameFunctioninstances which the factory already owns and frees on close. This restores the contract already honored byFilterOnValuesRecordCursorFactory, whose key-value list parameter is consumed at construction rather than retained.Equality and
INfilters on a renamed column of a Parquet partition could return too few rows (often zero), silently dropping valid data. The defect was in Parquet row-group bloom-filter pushdown, which resolved the filtered column against the Parquet file's column names. Those names are frozen when the partition is converted to Parquet, so a laterRENAME COLUMNleft them stale. When another column already carried the query's current name (common after a chain of renames), the pushdown landed on the wrong Parquet column, checked that column's bloom filter, got a false negative, and skipped the entire row group. This fix makes the native-table pushdown resolve columns by stable column id, the same way the data-decode path already does.read_parquet()cursors retain name-based resolution, which is the correct semantics for arbitrary external files without QuestDB column ids. Pruning effectiveness is unchanged for the common non-renamed case, and the bloom filter format, statistics, and decode paths are untouched.Many query factories only consulted the execution context's circuit breaker inside their row-processing loop, which never runs for queries that produce no rows (empty tables, no-match filters, empty joins/aggregates), for instant single-row results, or for parallel paths that dispatch nothing when the input is empty. This fix ensures every affected factory consults the breaker at the point its work starts: at the top of
hasNext()before the empty/advance guard, before build/aggregate loops, at cursor open for shared empty-cursor singletons, and beforedispatchAndAwait()on parallel paths. A new time-throttled breaker variant always tests cancellation and timeout (both cheap, no syscall) and throttles only the heavy connection probe syscall by elapsed wall-clock time, defaulting to a 100 ms window. The throttle state lives on the breaker shared across the whole query, so aCROSS JOINthat re-scans the slave once per master row issues at most one probe per window for the entire query. Every query now pays for one real breaker check on its first consultation, including a single connection-probe syscall. Queries that previously ran to completion under an aggressive timeout can now abort as expected, including catalogue and admin listings.A
SAMPLE BYquery usingfirst()orlast()aggregation with an indexedSYMBOLcolumn filter andALIGN TO FIRST OBSERVATIONcould return a bucket timestamp where an aggregated value was expected. The wrong results appeared only when the designated timestamp column was omitted from theSELECTlist, and only for buckets in the middle of the result set — the first and last buckets were always correct. The root cause was thatSampleByFirstLastRecordCursorFactoryemitted middle rows through a data record that special-cased the bucket-timestamp column using the base-table timestamp index, while the column it was handed is a projection index. The two indexes coincide only when the timestamp is projected at the same position it occupies in the base scan. When the timestamp was not projected, the base index pointed at an unrelated projected column and overwrote it with the bucket timestamp. This fix makes the data record use the same projection index as the boundary record, so when the timestamp is not projected, no column matches the special case and every aggregate reports its stored value.Subtracting a
CHARorSTRINGliteral from a timestamp, such asSELECT * FROM t WHERE timestamp > now() - '1' LIMIT 1, failed with an internal error. The literal was implicitly bound to theLONGoperand slot of the timestamp subtraction operator, but the factory read the right operand withgetTimestamp()instead ofgetLong(). CallinggetTimestamp()on aCharFunctionthrowsUnsupportedOperationException, which surfaced as an internal error. This fix changesSubTimestampFunctionFactoryto read the right operand withgetLong(), mirroring the symmetricAddLongToTimestampFunctionFactorywhich already did this correctly. As a result,-and+now produce identical, well-typed cast errors for non-numeric literals. For example,now() - '3 day'now reportsinconvertible value: '3 day' [STRING -> LONG]instead of the previous[STRING -> TIMESTAMP].Two malformed expressions involving
CASE ... ENDcrashed with internal errors instead of returning syntax errors. A binary operator with a missing right operand directly afterCASE ... END(e.g.,select sum(case when true then 1 else 0 end & )) caused aNullPointerExceptionbecause the expression parser'sENDkeyword handler left a stale depth counter that defeated the arity guard, allowing a tree node with a null left-hand side to be built. This fix resets the depth counter to 0 after the flush loop so the arity guard fires correctly, producing the errortoo few arguments for '&' [found=1,expected=2]. A dot directly afterCASE ... END(e.g.,select case when true then 1 else 0 end.foo) caused either aNullPointerExceptionorClassCastExceptionbecause the dot handler assumed a gluable literal token on the operator stack. This fix extends the token only when the stack top is a literal; otherwise it raises the syntax error'.' is unexpected here. ValidCASEexpressions and qualifiedtable.columnreferences are unaffected by both fixes.This fix addresses a storage-engine corruption that occurred when a WAL
REPLACE_RANGEcommit in O3 mode appended brand-new partitions above the table's previous last partition. The root cause was inTableWriter.processO3Block, where replace mode derivedpartitionTimestampHifromo3TimestampMax(the replace-range high boundary) rather than the highest timestamp actually written. For an open-ended range, the boundary isLong.MAX_VALUE - 1, causinggetCurrentPartitionMaxTimestampto overflow. As a result,partitionTimestampHistayed at the stale pre-commit ceiling,finishO3Commitskipped switching the writer's active column files to the new last partition, and the next commit reused the previous partition's column descriptors, producing rows dated below the new partition's floor or suspending the table. This fix derivespartitionTimestampHifrom the actual highest written timestamp (replaceMaxTimestamp) in replace mode, leaving the non-replace path unchanged.This fix resolves a data-ordering bug where a non-WAL writer ingesting out-of-order data could produce out-of-order rows when the in-order prefix crossed a partition boundary. The partition switch sealed the earlier partition in memory but did not persist
_txn. On the next lag commit,o3MoveUncommittedreclaimed the active partition's uncommitted in-order tail into the O3 buffer and resetmaxTimestampto the durable_txnvalue, which predated the switch. The high-water mark ended up below committed data, causing a rollback to reload it and the next O3 commit to reorder the partition tail. This fix ensuresmaxTimestampresets to the max timestamp of the last sealed partition by introducing a new variable that is updated whenever a partition switch happens.This fix addresses a data-ordering bug in the non-WAL out-of-order (O3) commit path where a lag commit (
TableWriter.ic()) could leave the table'smaxTimestampbelow the maximum timestamp actually committed to disk. When uncommitted rows of a lag commit spanned more than the active partition,o3MoveUncommitted()pulled only the active partition's rows into the sorted O3 batch and emptied it. The in-order rows left on disk in the previous partition stayed committed but were not part of the sorted O3 batch, somaxTimestampwas recomputed solely from the O3 batch boundary and could end up lower than the true on-disk maximum. A later O3 commit then merged against that stale boundary and left a single timestamp inversion. This fix reads the earlier partition's actual max timestamp viareadPartitionMinMaxTimestamps()and folds it into the committedmaxTimestampcomputation when the active partition is emptied while uncommitted rows remain in an earlier partition.This fix resolves a non-deterministic metadata reload failure where
MemoryCMRImpl.of()readerrnoafter a cleanupclose()call when a file's length could not be read. Sinceerrnois thread-local global state, the interveningclose()could overwrite theerrnoleft by the failedlength()call. The exception then carried the cleanup call'serrnoinstead of the real one. This mattered because metadata reload classifies failures byerrno:TableUtils.handleMetadataLoadException()retries the read only whileCairoException.isFileCannotRead()is true. With a clobberederrno, a transient "file does not exist" condition — such as a_metafile briefly unreadable during a concurrent metadata change — was misclassified as fatal, surfacing a "could not get length" error instead of retrying. This fix captureserrnoimmediately after the failedlength()call, beforeclose()runs.This fix encompasses a broad set of query engine hardening changes. Key fixes include:
SAMPLE BY FILL(LINEAR)cleanup no longer crashes withNullPointerExceptionon out-of-memory; covering posting index readers now propagate sidecar I/O errors instead of swallowing them;IntervalBwdPartitionFrameCursor.calculateSize()now returns correct counts for backward interval scans with open lower bounds; DATE-argument window value functions (max,min,first_value,last_value,nth_value,lag,lead) no longer throwUnsupportedOperationException;rank()/dense_rank()no longer crash when pass-through columns of unserializable types (UUID, STRING, VARCHAR, LONG256, arrays) are projected;lead()over high-precision DECIMAL no longer crashes; windowRANGE-frame functions now return correct values when the designated timestamp is not in the projection;sum()/avg()over Decimal256 now produce correct results over sliding frames by using scale-agnostic subtraction for eviction; JIT filter now correctly widens nested INT products to 64 bits when a FLOAT operand is present in the predicate;LimitedSizeLongTreeChaincleanup no longer crashes on partial construction failure; keyed ASOF JOIN sink-heap no longer leaks on out-of-memory during cursor open; vectorized rostiGROUP BYno longer over-freesNATIVE_ROSTImemory whenwrapUp()resizes the map; stale aggregate tasks in the shared vector-aggregate queue no longer crash later queries after a fault aborts a cursor drain;GROUP BYkeys referencing aliases of trivial arithmetic expressions now compile correctly; multi-key index scan no longer leaks posting-index block buffers on out-of-memory; and window functions over untypednullliterals now raise a cleanSqlExceptionsuggesting a concrete cast instead of behaving non-deterministically across platforms.This fix addresses two distinct faults where per-partition code paths iterating posting columns were guarded only against the absent case (
columnTop == -1), not the row-less case (columnTop >= partition size). In the first fault,restorePostingIndexersToLastPartition()would crash with "index does not exist" when encountering a row-less column with no.pkkey file, causing WAL table suspension or INSERT failures on BYPASS WAL tables. The fix discriminates on the key file's actual presence rather than row-less-ness, correctly re-pointing indexers for row-less columns that do have a.pk(created viaADD COLUMN ... INDEX TYPE POSTINGon the active partition) while skipping those without one. In the second fault,linkPartitionIndexFiles()would crash with "index files do not exist" when switching a partition to parquet and encountering a row-less column without a key file. The fix aligns the guard with its siblingcopyOrRebuildColumnIndexes()to skip row-less columns on historic partitions where a.pkis never built. The row-less-without-.pkstate arises from multi-partition, same-transaction out-of-order writes interleaved withADD COLUMN. Without the first fix, a blanket row-less skip would silently strand indexers, making rows invisible to indexed predicates with no exception or suspension. The accessor was also changed fromgetColumnTopQuicktogetColumnTopto correctly return-1for absent columns.This fix resolves a production bug where a query cancellation could be silently dropped, and the affected query's per-query timeout defeated along with it, when the cancel races query startup. The root cause was that
NetworkSqlExecutionCircuitBreaker.cancel()always sets apowerUpTime == Long.MIN_VALUEsentinel but only flips the per-query cancelled flag when that flag is already attached.QueryRegistry.register()publishes the query and fires its listener before binding the per-query flag, so aCancelRequestlanding in that window sets only the sentinel. The subsequentregister()call then binds a fresh flag with valuefalse, and neithertestTimeout()nortestCancelled()consulted the sentinel —testTimeout()would compute an overflowed negative runtime that never trips, andtestCancelled()only checked the flag. The query would run to natural completion ignoring both cancel and timeout, pinning workers. The fix makestestCancelled()check the sentinel first so both stateful paths abort a racing cancel, updatesgetState()to classify the sentinel asSTATE_CANCELLEDinstead of mislabelling itSTATE_TIMEOUT, and adds aclearCancelSentinel()call inPGConnectionContext.prepareForNewQuery()to bound the sentinel to a single query and prevent leaking across requests.This fix addresses several issues where HTTP parser and response objects were not fully resetting their per-request state when connections and contexts were reused across requests. A reused context could hit closed native parsing or compression buffers and crash the server when handling otherwise valid requests. Specifically, the fix reopens the header parser's quoted-value sink when pooled HTTP contexts are reused, clears per-request gzip negotiation state in
HttpResponseSinkso clients only receive gzip responses when the current request advertises gzip support, clearscharsetand mapped cookie state inHttpHeaderParser.clear(), fixes multipart false-boundary replay so body bytes are preserved when a boundary-like sequence is followed by non-boundary suffixes, keeps multipart resume pointers anchored to the receive buffer during retry paths, parsesContent-Dispositionparameters with quote-aware delimiter handling including semicolons and equals signs inside quoted values, and resetsContent-Dispositionparameter state consistently after known and unknown parameters. This allows uploaded filenames such asa;b.csvanda=b;c.csvto parse correctly.This fix addresses numerical-stability bugs in the
corr()function where the Pearson denominatorsqrt(sumXX * sumYY)can overflow or underflow while computing the product of two sums of squared deviations. With large-magnitude inputs (values near+/-1e153), each sum is finite (~1e306) but their product overflows to+Infinity, causing the final division to return0.0instead of the true correlation. With small-magnitude inputs (values near+/-1e-150), each sum is finite (~1e-300) but their product underflows to0.0, causing the division to returnNaN. The fix prefers the single-roundingsqrt(a * b)denominator when the product is finite and non-zero, preserving existing bit-exact behavior for normal inputs, and falls back tosqrt(a) * sqrt(b)when the product would overflow or underflow while both factors are non-zero. The final Pearson result is clamped to[-1, 1]to absorb small rounding drift in the fallback path. This applies toCorrGroupByFunctionFactory,AbstractBivariateStatWindowFunctionFactory.computeCorr, andAbstractBivariateStatWindowFunctionFactory.computeCorrWelford. Normal-magnitude inputs are unaffected and keep their prior bit-exact results.This fix addresses a production issue where a materialized view's incremental refresh could be permanently invalidated by transient errors such as reader pool exhaustion (
EntryUnavailableException) or out-of-memory conditions, cascading invalidation up dependent view chains. On transient "table busy" or OOM errors, the refresh now schedules a per-view backoff deadline and returns without invalidating, withMatViewTimerJobre-enqueuing the refresh once the backoff elapses. A consecutive retry counter caps retries at a configurable limit (default 10 viacairo.mat.view.refresh.busy.retry.limit), after which the view is invalidated to bound WAL retention. Thematerialized_views.view_statuscolumn now returnsretryingwhile a view is in a transient-refresh backoff window. A newcairo.mat.view.refresh.block.listconfiguration option accepts a comma-separated list of materialized view names that the refresh job must never refresh, serving as an escape hatch for views whose refresh crashes or destabilizes the database. Listed views are skipped by all refresh paths without being invalidated. The OOM path no longer callsOs.sleepbetween step-halving attempts, preventing the single refresh worker from being blocked. Genuine errors such as bad SQL, type mismatches, or dropped base tables still invalidate immediately. Configuration options includecairo.mat.view.refresh.busy.retry.timeout(default 1000ms backoff between retries) andcairo.mat.view.refresh.busy.retry.limit(default 10 consecutive retries before invalidating). A blocked view grows stale and can pin base-table WAL retention until dropped or unblocked.