Database replication operations

QuestDB Enterprise supports high availability through primary-replica replication and point-in-time recovery.

This document will walk you through setup for database replication.

If the cluster is already running, enabling replication requires minimal steps:

Create and configure object storage for Write Ahead Log (WAL) files in AWS, Azure, or NFS
Enable a primary node and upload WAL files to the object storage
Take a data Snapshot of the primary node
Configure a replica node or and restore via snapshot or allow sync via WAL files

If the cluster is new and not already running:

Create and configure object storage for Write Ahead Log (WAL) files in AWS, Azure, or NFS
Enable a primary node and upload WAL files to the object storage
Enable one or more replica nodes

Before you begin the setup process, consider reading Replication concepts.

Setup object storage

Choose where you intend to store replication data:

Our goal is to build a string value for the replication.object.store key within server.conf.

Azure Blob Storage

Setup storage in Azure and retrieve values for:

STORE_ACCOUNT
BLOB_CONTAINER
STORE_KEY

First, follow Azure documentation to create a Storage Account.

There are some important considerations.

For appropriate balance, be sure to:

Select a geographical location close to the primary QuestDB node to reduce the network latency
Choose optimal redundancy and performance options according to Microsoft
Disable blob versioning

Keep your STORE_ACCOUNT value.

Next, set up Lifecycle Management for the blobs produced by replication. There are considerations to ensure cost-effective WAL file storage. For further information, see the object store expiration policy section.

After that, create a Blob Container to be the root of your replicated data blobs.

It will will soon be referenced in the BLOB_CONTAINER variable.

Finally, save the Account Key. It will be used to configure the QuestDB primary node as STORE_KEY.

In total, from Azure you will have retrieved:

STORE_ACCOUNT
BLOB_CONTAINER
STORE_KEY

The value provided to replication.object.store is thus:

azblob::endpoint=https://${STORE_ACCOUNT}.blob.core.windows.net;container={BLOB_CONTAINER};root=${DB_INSTANCE_NAME};account_name=${STORE_ACCOUNT};account_key=${STORE_KEY};

The value of DB_INSTANCE_NAME can be any unique alphanumeric string, which includes dashes -.

Be sure to use the same name across all the primary and replica nodes within the replication cluster.

With your values, skip to the Setup database replication section.

Amazon AWS S3

Our goal is to setup AWS S3 storage and retrieve:

BUCKET_NAME
AWS_REGION
AWS_ACCESS_KEY (Optional)
AWS_SECRET_ACCESS_KEY (Optional)

First, create an S3 bucket as described in AWS documentation. The name is our BUCKET_NAME & AWS_REGION. Prepare your AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY if needed, depending on how you manage AWS credentials.

There are some important considerations.

For appropriate balance, be sure to:

Select a geographical location close to the primary QuestDB node to reduce the network latency
Choose optimal redundancy and performance options according to Amazon
Disable blob versioning

Finally, set up bucket lifecycle configuration policy to clean up WAL files after a period of time. There are considerations to ensure that the storage of the WAL files remains cost-effective. For deeper background, see the object storage expiration policy section.

We have now prepared the following:

BUCKET_NAME
AWS_REGION
AWS_ACCESS_KEY (Optional)
AWS_SECRET_ACCESS_KEY (Optional)

And created a value for replication.object.store:

s3::bucket=${BUCKET_NAME};root=${DB_INSTANCE_NAME};region=${AWS_REGION};access_key_id=${AWS_ACCESS_KEY};secret_access_key=${AWS_SECRET_ACCESS_KEY};

The value of DB_INSTANCE_NAME can be any unique alphanumeric string, which includes dashes -.

Be sure to use the same name across all the primary and replica nodes within the replication cluster.

With your values, continue to the Setup database replication section.

NFS

Setup your NFS server and mount the shared file system on the primary and any replicas. Make sure the user starting QuestDB has read and write permissions for the shared mount.

There are some important considerations.

For appropriate balance, be sure to:

Select a geographical location of the NFS server close to the primary QuestDB node to reduce the network latency
Choose optimal redundancy and performance options

There are considerations to ensure cost-effective WAL file storage. For further information, see the object store expiration policy section.

Replication via NFS will use two folders, one for the WAL files, and one for temporary — or scratch — files. The two folders will be created on primary's startup if they don't exist. It is important that both folders are under the same NFS mount, as otherwise object writes might get corrupted.

The value provided to replication.object.store is thus:

fs::root=/mnt/nfs_replication/final;atomic_write_dir=/mnt/nfs_replication/scratch;

The example above uses /mtn/nfs_replication as the NFS mountpoint. Please change accordingly on the primary and any replicas to match your local configuration.

With your values, skip to the Setup database replication section.

Google GCP GCS

First, create a new Google Cloud Storage (GCS) bucket, most likely with Public access: Not public.

Then create a new service account, and give it read-write permissions for the bucket. The simplest role is Storage Admin, but you may set up more granular permissions as needed.

Create a new private key for this user and download it in JSON format. Then encode this key as Base64.

If you are on Linux, you can cat the file and pass it to base64:

cat <key>.json | base64

Then construct the connection string:

replication.object.store=gcs::bucket=<bucket name here>;root=/;credential=<base64 encoded key>;

If you do not want to put the credentials directly in the connection string, you can swap the credential key for credential_path, and give it a path to the key-file.

With your values, continue to the Setup database replication section.

Setup database replication

Set the following changes in their respective server.conf files:

Enable a primary node to upload to object storage
Set replica(s) to download from object storage

Set up a primary node

Setting	Description
replication.role	Set to `primary` .
replication.object.store	Created based on provider specifications. The result of the above setup object storage sections.
cairo.snapshot.instance.id	Unique UUID of the primary node

After the node is configured for replication, restart QuestDB.

At this point, create a database snapshot.

Frequent snapshots can alter the effectiveness of your replication strategy.

To help you determine the right snapshot, see the snapshot schedule section.

Now that a primary is configured, next setup a replica - or two, or three - or more!

Set up replica node(s)

Create a new QuestDB instance.

Set server.conf properties:

Setting	Value
replication.role	Set to `replica`.
replication.object.store	The same string used in the primary node
cairo.snapshot.instance.id	Unique UUID of the replica node

note

Please do not copy server.conf files from the primary node when creating the replica. Setting the same replication.object.store stream on 2 nodes and enabling 2 nodes to act as primary will break the replication setup.

After the blank replica database is created, restore the db directory folder from the snapshot taken from the primary node. Then start the replica node. The replica will download changes and will catch up with the primary node.

This concludes a walkthrough of basic replication.

For full configuration details, see the next section.

To learn more about the roadmap, architecture and topology types, see the Replication concept page.

Configuration

The following presents all available configuration and tuning options.

All replication configuration is kept in the same server.conf file as all other database settings.

note

These settings can be sensitive - especially within Azure.

Consider using environment variables.

For example, to specify the object store setting from an environment variable specify:

export QDB_REPLICATION_OBJECT_STORE="azblob::DefaultEndPointsProtocol..."

Once settings are changed, stop and restart the database.

Note that replication is performed by the database process itself.

There is no need to start external agents, register cron jobs or similar.

Replication settings

Read our in-depth Replication tuning Guide for more information.

Some of these settings alter resource usage.

Replication is implemented using a set of worker threads for IO operations.

The defaults should be appropriate in most cases.

Property	Default	Reloadable	Description
replication.role	none	No	Defaults to `none` for stand-alone instances. To enable replication set to one of: `primary`, `replica`.
replication.object.store		No	A configuration string that allows connecting to an object store. The format is scheme::key1=value;key2=value2;…. The various keys and values are detailed in a later section. Ignored if replication is disabled. No default given variability.
cairo.wal.segment.rollover.size	2097152	No	The size of the WAL segment before it is rolled over. Default is `2MiB`. However, defaults to `0` unless `replication.role=primary` is set.
cairo.writer.command.queue.capacity	32	No	Maximum writer ALTER TABLE and replication command capacity. Shared between all the tables.
replication.primary.throttle.window.duration	10000	No	The millisecond duration of the sliding window used to process replication batches. Default is `10000` ms.
replication.requests.max.concurrent	0	No	A limit to the number of concurrent object store requests. The default is `0` for unlimited.
replication.requests.retry.attempts	3	No	Maximum number of times to retry a failed object store request before logging an error and reattempting later after a delay. Default is `3`.
replication.requests.retry.interval	200	No	How long to wait before retrying a failed operation. Default is `200` ms.
replication.primary.compression.threads	calculated	No	Max number of threads used to perform file compression operations before uploading to the object store. The default value is calculated as half the number of CPU cores.
replication.primary.compression.level	1	No	Zstd compression level. Defaults to `1`. Valid values are from 1 to 22.
replication.replica.poll.interval	1000	No	Millisecond polling rate of a replica instance to check for the availability of new changes.
replication.primary.sequencer.part.txn.count	5000	No	Sets the txn chunking size for each compressed batch. Smaller is better for constrained networks (but more costly).
replication.primary.checksum=service-dependent	service-dependent	No	Where a checksum should be calculated for each uploaded artifact. Required for some object stores. Other options: never, always
replication.primary.upload.truncated	true	No	Skip trailing, empty column data inside a WAL column file.
replication.requests.buffer.size	32768	No	Buffer size used for object-storage downloads.
replication.summary.interval	1m	No	Frequency for printing replication progress summary in the logs.
replication.metrics.per.table	true	No	Enable per-table replication metrics on the prometheus metrics endpoint.
replication.metrics.dropped.table.poll.count	10	No	How many scrapes of prometheus metrics endpoint before dropped tables will no longer appear.
replication.requests.max.batch.size.fast	64	No	Number of parallel requests allowed during the 'fast' process (non-resource constrained).
replication.requests.max.batch.size.slow	2	No	Number of parallel requests allowed during the 'slow' process (error/resource constrained path).
replication.requests.base.timeout	10s	No	Replication upload/download request timeout.
replication.requests.min.throughput	262144	No	Expected minimum network speed for replication transfers. Used to expand the timeout and account for network delays.
native.async.io.threads	cpuCount	No	The number of async (network) io threads used for replication (and in the future cold storage). The default should be appropriate for most use cases.
native.max.blocking.threads	cpuCount * 4	No	Maximum number of threads for parallel blocking disk IO read/write operations for replication (and other). These threads are ephemeral: They are spawned per need and shut down after a short duration if no longer in use. These are not cpu-bound threads, hence the relative large number. The default should be appropriate for most use cases.

Snapshot schedule and object store expiration policy

Replication files are typically read by replica nodes shortly after upload from the primary node. After initial access, these files are rarely used unless a new replica node starts. To optimize costs, we suggest moving files to cooler storage tiers using expiration policies after 1-7 days. These tiers are more cost-effective for long-term storage of infrequently accessed files.

We recommend:

Set up periodic primary node snapshots on a 1-7 day interval
Keep Write Ahead Log (WAL) files in the object store for at least 30 days

Taking snapshots every 7 days and storing WAL files for 30 days allows database restoration within 23 days. Extending WAL storage to 60 days increases this to 53 days.

Ensure snapshot intervals are shorter than WAL expiration for successful data restoration. Shorter intervals also speed up database rebuilding after a failure. For instance, weekly snapshots take 7 times longer to restore than daily ones due to the computational and IO demands of applying WAL files.

For systems with high daily data injection, daily snapshots are recommended. Infrequent snapshots or long snapshot periods, such as 60 days with 30-day WAL expiration, may prevent successful database restoration.

Disaster recovery

QuestDB can fail in a number of ways, some recoverable, and some unrecoverable.

In general, we can group them into a small matrix:

	recoverable	unrecoverable
primary	restart primary	promote replica, create new replica
replica	restart replica	destroy and recreate replica

To successfully recover from serious failures, we strongly advise that operators:

Follow best practices
Regularly back up data

Network partitions

Temporary network partitions introduce delays between when data is written to the primary, and when it becomes available for read in the replica. A temporary network partition is not necessarily a problem.

For example, data can be ingested into the primary when the object-store is not available. In this case, the replicas will contain stale data, and then catch-up when the primary reconnects and successfully uploads to the object store.

Permanent network partitions are not recoverable, and the emergency primary migration flow should be followed.

Instance crashes

An instance crash may be recoverable or unrecoverable, depending on the specific cause of the crash. If the instance crashes during ingestion, then it is possible for transactions to be corrupted. This will lead to a table suspension on restart.

To recover in this case, you can skip the transaction, and reload any missing data.

In the event that the corruption is severe, or confidence in the underlying instance is removed, you should follow the emergency primary migration flow.

Disk or block storage failure

Disk failures can present in several forms, which may be difficult to detect.

Look for the following symptoms:

High latency for reads and writes.
- This could be a failing disk, which will need replacing
- Alternatively, it could be caused by under-provisioned IOPS, and need upgrading
Disk not available/unmounted
- This could be a configuration issue between your server and storage
- Alternatively, this could indicate a complete drive failure
Data corruption reported by database (i.e. you see suspended tables)
- This is usually caused by writes to disk partially or completely failing
- This can also be caused by running out of disk space

As with an instance crash, the consequences can be far-reaching and not immediately clear in all cases.

To migrate to a new disk, follow the emergency primary migration flow. When you create a new replica, you can populate it with the latest snapshot you have taken, and then recover the rest using replicated WALs in the object store.

Flows

Planned primary migration

Use this flow when you want to change your primary to another instance, but the primary has not failed.

The database can be started in a mode which disallows further ingestion, but allows replication. With this method, you can ensure that all outstanding data has been replicated before you start ingesting into a new primary instance.

Ensure primary instance is still capable of replicating data to the object store
Stop primary instance
Restart primary instance with replication.role=primary-catchup-uploads
Wait for the instance to complete its uploads and exit with code 0
Then follow the emergency primary migration flow

Emergency primary migration

Use this flow when you wish to discard a failed primary instance and move to a new one.

Stop primary instance, and ensure it cannot restart
Stop the replica instance
Set replication.role=primary on the replica
Ensure other primary-related settings are configured appropriately
- for example, snapshotting policies
Create an empty _migrate_primary file in your database installation directory (i.e. the parent of conf and db)
Start the replica instance, which is now the new primary
Create a new replica instance to replace the promoted replica

warning

Any data committed to the primary, but not yet replicated, will be lost. If the primary has not completely failed, you can follow the planned primary migration flow to ensure that all remaining data has been replicated before switching primary.

When could migration fail?

Two primaries started within the same replication.primary.keepalive.interval=10s may still break.

It is important not to migrate the primary without stopping the first primary, if it is still within this interval.

This config can be set in the range of 1 to 300 seconds.

Point-in-time recovery

Create a QuestDB instance matching a specific historical point in time.

This builds a new instance based on a recently recovered snapshot and WAL data in the object store.

It can also be used if you wish to remove the latest transactions from the database, or if you encounter corrupted transactions (though replicating a corrupt transaction has never been observed).

Flow

(Recommended) Locate a recent primary instance snapshot that predates your intended recovery timestamp.
- A snapshot taken from after your intended recovery timestamp will not work.
Create the new primary instance, ideally from a snapshot, and ensure it is not running.
Touch a _recover_point_in_time file.
Inside this file, add a replication.object.store setting pointing to the object store you wish to load transactions from.
Also add a replication.recovery.timestamp setting with the UTC time to which you would like to recover.
- The format is YYYY-MM-DDThh:mm:ss.mmmZ.
(Optional) Configure replication settings in server.conf pointing at a new object store location.
- You can either configure this instance as a standalone (non-replicated) instance, or
- Configure it as a new primary by setting replication.role=primary. In this case, the replication.object.store must point to a fresh, empty location.
If you have created the new primary using a snapshot, touch a _restore file to trigger the snapshot recovery process.
- More details can be found in the backup and restore documentation.
Start new primary instance.

Multi-primary ingestion

QuestDB Enterprise supports multi-primary ingestion, where multiple primaries can write to the same database.

See the Multi-primary ingestion page for more information.

Setup object storage​

Azure Blob Storage​

Amazon AWS S3​

NFS​

Google GCP GCS​

Setup database replication​

Set up a primary node​

Set up replica node(s)​

Configuration​

Replication settings​

Snapshot schedule and object store expiration policy​

Disaster recovery​

Network partitions​

Instance crashes​

Disk or block storage failure​

Flows​

Planned primary migration​

Emergency primary migration​

When could migration fail?​

Point-in-time recovery​

Multi-primary ingestion​

Setup object storage

Azure Blob Storage

Amazon AWS S3

NFS

Google GCP GCS

Setup database replication

Set up a primary node

Set up replica node(s)

Configuration

Replication settings

Snapshot schedule and object store expiration policy

Disaster recovery

Network partitions

Instance crashes

Disk or block storage failure

Flows

Planned primary migration

Emergency primary migration

When could migration fail?

Point-in-time recovery

Multi-primary ingestion