diff --git a/documentation/operations/backup.md b/documentation/operations/backup.md index d99d9c5da..dfa21aa67 100644 --- a/documentation/operations/backup.md +++ b/documentation/operations/backup.md @@ -6,67 +6,341 @@ description: --- You should back up QuestDB to be prepared for the case where your original -database or data is lost, or if your database or table is corrupted. The backup -& restore process is also necessary to create -[replica instances](/docs/high-availability/setup/) in QuestDB Enterprise. +database or data is lost, or if your database or table is corrupted. Backups are +also required to create [replica instances](/docs/high-availability/setup/) in +QuestDB Enterprise. ## Overview -To perform a backup, follow these steps: +QuestDB supports two backup methods: -1. Issue SQL: `CHECKPOINT CREATE`, which activates the special `CHECKPOINT` mode - of QuestDB -2. Create a copy of the QuestDB root directory -3. Issue SQL: `CHECKPOINT RELEASE`, bringing QuestDB back to regular operation +- **Built-in incremental backup** (Enterprise only): Fully automated—configure + once, set a schedule, and backups run automatically. Provides point-in-time + recovery out of the box with no manual steps required. -When in the `CHECKPOINT` mode, QuestDB remains available for both reads and -writes. However, some housekeeping tasks are paused. While this is safe in -principle, database writes may consume more space than normal. When the database -exits `CHECKPOINT` mode, it will resume the housekeeping tasks and quickly -reclaim the disk space. +- **[Manual checkpoint backup](#questdb-oss-manual-backups-with-checkpoints)** + (OSS and Enterprise): Relies on external tools to copy data. Requires manual + coordination: `CHECKPOINT CREATE` → copy data with external tools → + `CHECKPOINT RELEASE`. Works well with cloud disk snapshots (AWS EBS, Azure + disks, etc.) where you simply trigger a snapshot. For on-premises environments + without snapshot capabilities, you'll need external tools or custom scripts + (e.g., rsync), which do not provide point-in-time recovery. -In the second step above, you must create a copy of the database using a tool of -your choice. These are some suggestions: +## QuestDB Enterprise: built-in backup and restore + +QuestDB Enterprise provides an incremental backup system that stores your data +in object storage. Backups are incremental—only changed data is uploaded—making +them fast and bandwidth-efficient. You can monitor progress and check for errors +while backups run. + +:::note + +See [Limitations](#limitations) before running your first backup. + +::: + +### Quick start + +Minimal configuration to enable backups: + +```conf +backup.enabled=true +backup.object.store=s3::bucket=my-bucket;region=eu-west-1;access_key_id=...;secret_access_key=...; +``` + +Then run `BACKUP DATABASE;` in SQL. See [Run a backup](#run-a-backup) for details. + +### Configure + +See [Configure object storage](/docs/high-availability/setup/#1-configure-object-storage) +for connection string format. + +#### Scheduled backups + +You can configure automatic scheduled backups using cron syntax. The example +below runs a backup every day at midnight UTC. + +```conf +backup.schedule.cron=0 0 * * * +backup.schedule.tz=UTC +``` + +The `backup.schedule.tz` property accepts any valid +IANA timezone name +(e.g., `America/New_York`, `Europe/London`) or `UTC`. + +These settings can be modified in `server.conf` and hot-reloaded without +restarting the server: + +```questdb-sql +SELECT reload_config(); +``` + +#### Backup retention + +Control how many backups to keep before automatic cleanup removes older ones: + +```conf +backup.cleanup.keep.latest.n=7 +``` + +#### Filesystem backups + +For local testing or air-gapped environments, you can back up to a local +filesystem path instead of cloud object storage: + +```conf +backup.object.store=fs::root=/mnt/backups;atomic_write_dir=/mnt/backups/atomic; +``` + +The `atomic_write_dir` parameter is required for filesystem backends and +specifies a directory for atomic write operations during backup. + +#### Configuration reference + +| Property | Description | Default | +|----------|-------------|---------| +| `backup.enabled` | Enable backup functionality | `false` | +| `backup.object.store` | Object store connection string | None (required) | +| `backup.schedule.cron` | Cron expression for scheduled backups | None (manual only) | +| `backup.schedule.tz` | IANA timezone for cron schedule | `UTC` | +| `backup.cleanup.keep.latest.n` | Number of backups to retain | `5` | +| `backup.compression.level` | Compression level (1-22) | `5` | +| `backup.compression.threads` | Threads for compression | CPU count | + +### Performance characteristics + +Backup is designed to prioritize database availability over backup speed. Key +characteristics: + +- **Pressure-sensitive**: Backup automatically throttles itself to avoid + overwhelming the database instance during normal operations +- **Batch uploads**: Data uploads in batches rather than continuously - you may + see surges of activity followed by quieter periods in logs +- **Compressed**: Data is compressed before upload to reduce transfer time and + storage costs +- **Multi-threaded**: Backup uses multiple threads but is deliberately + throttled to maintain instance reliability + +Backup duration depends on data size. Large databases (1TB+) may take several +hours for a full initial backup. Subsequent incremental backups are faster as +only changed data is uploaded. + +### Limitations + +- **One backup at a time**: Only one backup can run at any given time. Starting + a new backup while one is running will return an error. +- **Primary and replica backups are separate**: Each QuestDB instance has its + own [`backup_instance_name`](#finding-your-instance-name), so backing up both + a primary and its replica creates two separate backup sets in the object + store. Typically, backing up the primary is sufficient since replicas sync + from the same data. + +### Run a backup + +Once configured, you can run a backup at any time using the following command: + +```questdb-sql title="Backup database" +BACKUP DATABASE; +``` + +Example output: + +| backup_timestamp | +| ----------------------------- | +| 2024-08-24T12:34:56.789123Z | + +### Monitor and abort + +You can monitor backup progress and history using the `backups()` table function: + +```questdb-sql title="Backup history" +SELECT * FROM backups(); +``` + +Example output: + +| status | progress_percent | start_ts | end_ts | backup_error | cleanup_error | +|---------------------|------------------|-----------------------------|-----------------------------|------------------|---------------| +| backup complete | 100 | 2025-07-30T12:49:30.554262Z | 2025-07-30T16:19:48.554262Z | | | +| backup complete | 100 | 2025-08-06T14:15:22.882130Z | 2025-08-06T17:09:57.882130Z | | | +| backup failed | 35 | 2025-08-20T11:58:03.675219Z | 2025-08-20T12:14:07.675219Z | connection error | | +| backup in progress | 10 | 2025-08-27T15:42:18.281907Z | | | | +| cleanup in progress | 100 | 2025-08-13T13:37:41.103729Z | 2025-08-13T16:44:25.103729Z | | | + +Status values: + +| Status | Meaning | Action | +|-----------------------|----------------------------------|---------------------------------| +| `backup in progress` | Backup is currently running | Wait or run `BACKUP ABORT` | +| `backup complete` | Backup finished successfully | None required | +| `backup failed` | Backup encountered an error | Check `backup_error` column | +| `cleanup in progress` | Old backup data is being removed | Wait for completion | +| `cleanup complete` | Cleanup finished successfully | None required | +| `cleanup failed` | Cleanup encountered an error | Check `cleanup_error` column | + +To abort a running backup: + +```questdb-sql title="Abort backup" +BACKUP ABORT; +``` + +### Restore + +:::caution + +Enterprise backup restore uses a different trigger file (`_backup_restore`) than +OSS checkpoint restore (`_restore`). Do not confuse these two mechanisms. + +::: + +To restore from an object store backup, create a `_backup_restore` file in the +QuestDB install root. This is a properties file with the object store +configuration and optional selector fields. On startup, QuestDB reads this file, +selects the requested backup timestamp (or the latest available), downloads the +backup data, and reconstructs the local database state. + +```conf +backup.object.store=s3::bucket=my-bucket;region=eu-west-1;access_key_id=...;secret_access_key=...; +backup.instance.name=gentle-forest-orchid +backup.restore.timestamp=2024-08-24T12:34:56.789123Z +``` + +Parameters: + +| Parameter | Required | Description | +|-----------|----------|-------------| +| `backup.object.store` | Sometimes | Object store connection string; required unless already specified in `server.conf` | +| `backup.instance.name` | Sometimes | Required when multiple instance names exist in the bucket | +| `backup.restore.timestamp` | No | Specific backup to restore; omit for latest | + +#### Finding your instance name + +Each QuestDB instance has an auto-generated backup instance name (three random +words like `gentle-forest-orchid`). This name organizes backups in the object +store under `backup//`. + +To find your instance name: + +- **File system**: Read `/db/.backup_instance_name` +- **Object store**: List directories under `backup/` in your bucket +- **Error message**: If you omit `backup.instance.name` when multiple instances + exist, the error message lists available options + +The `backup.instance.name` parameter is only required when multiple QuestDB +instances share the same object store. If only one instance exists, it is +detected automatically. + +:::warning + +Restore requires an empty database directory. If the target database already +has data (indicated by the presence of `db/.data_id`), restore fails with: +"The local database is not empty." Use a fresh installation directory for +restore operations. + +::: + +Restart QuestDB. If restore succeeds, `_backup_restore` is removed automatically. + +#### Restore failure recovery + +If restore fails, QuestDB creates artifacts to help diagnose and recover: + +| Artifact | Purpose | +|----------|---------| +| `.restore_failed/` | Directory containing tables that failed to restore | +| `_restore_failed` | File listing the names of failed tables | + +To recover from a failed restore: + +1. Check the `.restore_failed/` directory and `_restore_failed` file for details +2. Investigate and fix the underlying issue (connectivity, permissions, etc.) +3. Remove both `.restore_failed/` directory and `_restore_failed` file +4. Restart QuestDB to retry the restore + +If you see the error "Failed restore directory found", a previous restore +attempt failed. Remove the artifacts listed above before restarting. + +### Create a replica from a backup + +You can use a backup to bootstrap a new replica instance instead of relying +solely on WAL replay from the object store. This is faster when the backup is +more recent than the oldest available WAL data. + +1. **Ensure the primary is running and has replication configured** + + The primary must have `replication.role=primary` and a configured + `replication.object.store`. + +2. **Create a `_backup_restore` file on the new replica machine** + + Point it to the same backup location used by the primary: + + ```conf + backup.object.store=s3::bucket=my-bucket;region=eu-west-1;access_key_id=...;secret_access_key=...; + backup.instance.name=gentle-forest-orchid + ``` + +3. **Configure the replica** + + Set `replication.role=replica` and ensure `replication.object.store` points + to the same object store as the primary. + +4. **Start the replica** + + QuestDB restores from the backup first, then switches to WAL replay to catch + up with the primary. + +For more details on replication setup, see the +[replication guide](/docs/high-availability/setup/). + +### Troubleshooting + +If you encounter errors during backup or restore: + +- **ER007 - Data ID mismatch**: The local database and object store have + different Data IDs. See [error code ER007](/docs/troubleshooting/error-codes/#er007) + for resolution steps. +- **Backup stuck at 0%**: Check network connectivity to the object store and + verify credentials are correct. +- **"Failed restore directory found"**: A previous restore attempt failed. + Remove the `.restore_failed/` directory and `_restore_failed` file, then + restart. See [Restore failure recovery](#restore-failure-recovery). +- **"The local database is not empty"**: Restore requires an empty database + directory. Use a fresh installation or remove the existing `db/` directory. + +## QuestDB OSS: manual backups with checkpoints + +The OSS workflow relies on the `CHECKPOINT` mode and external snapshot or file +copy tools. When in `CHECKPOINT` mode, QuestDB remains available for reads and +writes, but some housekeeping tasks are paused. This is safe in principle, but +database writes may consume more space than normal. When the database exits +`CHECKPOINT` mode, it resumes the housekeeping tasks and reclaims disk space. + +You must create a copy of the database using a tool of your choice. These are +some suggestions: - Cloud snapshot, e.g. EBS volume snapshot on AWS, Premium SSD Disk snapshot on Azure etc - On-prem backup tools and software you typically use - Basic command line tools, such as `cp` or `rsync` -To recover the database, follow these steps: - -1. Restore the QuestDB root directory from the backup copy -2. Create an empty `_restore` trigger file in the QuestDB root directory -3. Start QuestDB as usual - -If the trigger file is present in the root directory, QuestDB performs the -recovery process on startup. If successful, the process deletes the trigger -file, so it won't perform recovery in future restarts. Should recovery fail, -QuestDB will exit with an error, and the trigger file will remain in place. - -## Data backup checklist +### Data backup checklist Before backing up QuestDB, consider these items: -### Pick a good time +#### Pick a good time We recommend that teams take a database backup when the database write load is at its lowest. If the database is under constant write load, a helpful workaround is to ensure that the disk has at least 50% free space. The more free space, the safer it is to enter the checkpoint mode. -### Determine backup frequency +#### Determine backup frequency -We recommend daily backups. +We recommend daily backups for disaster recovery purposes. -If you are using QuestDB Enterprise, the frequency of backups impacts the time -it takes to create a new [replica instance](/docs/high-availability/setup/). -Creating replicas involves choosing a backup and having the replica replay WAL -files until it has caught up. The older the backup, the more WAL files the -replica will have to replay, and thus there is a longer time-frame. For these -reasons, we recommend a daily backup schedule to keep the process rapid. - -### Choose your data copy method +#### Choose your data copy method When choosing the right copy method, consider the following goals: @@ -77,13 +351,13 @@ QuestDB backup lends itself relatively well to all types of differential data copying. Due to time partitioning, older data is often unmodified, at both block and file levels. -#### Cloud snapshots +##### Cloud snapshots If you're using cloud disks, such as EBS on AWS, SSD on Azure, or similar, we strongly recommend using their existing cloud _snapshot_ infrastructure. The advantages of this approach are that: -- Cloud snapshots minimizes the time QuestDB spends in checkpoint mode +- Cloud snapshots minimize the time QuestDB spends in checkpoint mode - Cloud snapshots are differential and can be restored cleanly See the following guides for volume snapshot creation on the following cloud @@ -102,7 +376,7 @@ steps: 1. Take a snapshot 2. Back up the snapshot -**Exit the `CHECKPOINT` mode as soon the snapshoting stage is complete.** +**Exit the `CHECKPOINT` mode as soon as the snapshotting stage is complete.** Specifically, exit checkpoint mode at the following snapshot stage: @@ -112,13 +386,13 @@ Specifically, exit checkpoint mode at the following snapshot stage: | **Amazon Web Services** (AWS) | PENDING | When status is PENDING | | **Microsoft Azure** | PENDING | Before the longer running "CREATING" stage | -#### Volume snapshots +##### Volume snapshots -When the database is on-prem, we recommend using the existing file system backup -tools. Volume snapshots by, for example, can be taken via LVM: -([LVM](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/logical_volume_manager_administration/lvm_overview)). +When the database is on-prem, we recommend using existing file system backup +tools. For example, volume snapshots can be taken via +[LVM](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/logical_volume_manager_administration/lvm_overview). -#### File copy +##### File copy If filesystem or volume snapshots are not available, consider using a file copy method to back up the QuestDB server root directory. We recommend using a copy @@ -131,12 +405,12 @@ Leaving this step, you should know: - When to enter and exit checkpoint mode - How to perform your snapshot/backup method -## Steps in the backup procedure +### Steps in the backup procedure While explaining the steps, we'll assume the database root directory is `/var/lib/questdb`. -### Enter checkpoint mode +#### Enter checkpoint mode To enter the checkpoint mode: @@ -147,7 +421,7 @@ CHECKPOINT CREATE You can create only one checkpoint. Attempting to create a second checkpoint will fail. -### Check checkpoint status +#### Check checkpoint status You can double-check at any time that the database is in the checkpoint mode: @@ -158,7 +432,7 @@ SELECT * FROM checkpoint_status(); Having confirmed that QuestDB has entered the checkpoint mode, we now create the backup. -### Take a snapshot or begin file copy +#### Take a snapshot or begin file copy After a checkpoint is created and before it is released, you may safely access the file system using tools external to the database instance. In other words, @@ -177,7 +451,7 @@ mode. **It is very important to exit the checkpoint mode regardless of whether the copy operation succeeded or failed!** -### Exit checkpoint mode +#### Exit checkpoint mode With your backup complete, exit checkpoint mode: @@ -189,9 +463,16 @@ This concludes the backup process. Now, with our additional copy, we're ready to restore QuestDB. -## Restore to a saved checkpoint +### Restore to a saved checkpoint + +Restoring from a local checkpoint will restore the entire database. + +:::caution -Restoring to a checkpoint will restore the entire database. +OSS checkpoint restore uses the `_restore` trigger file. This is different from +Enterprise backup restore which uses `_backup_restore`. + +::: Follow these steps: @@ -200,22 +481,40 @@ Follow these steps: - Touch the `_restore` file - Start the database using the restored root directory -### Database versions +#### Database versions Restoring data is only possible if the backup and restore QuestDB versions have the same major version number, for example: `8.1.0` and `8.1.1` are compatible. `8.1.0` and `7.5.1` are not compatible. -### Restore the root directory +#### Restore the root directory When using cloud tools, create a new disk from the snapshot. The entire disk contents of the original database will be available when the compute instance starts. +:::warning + +**AWS EBS lazy loading**: By default, EBS volumes created from snapshots load +data lazily (on first access), which can cause slow reads after restore. To +mitigate this: + +- **Option 1**: Enable [Fast Snapshot Restore (FSR)](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-fast-snapshot-restore.html) + on the snapshot before creating the volume +- **Option 2**: Pre-warm the volume by reading all blocks after restore: + ```bash + sudo fio --filename=/dev/nvme1n1 --rw=read --bs=1M --iodepth=32 \ + --ioengine=libaio --direct=1 --name=volume-initialize + ``` + +This issue may also affect other cloud providers with similar snapshot behavior. + +::: + If you are not using cloud tools, you have to make sure that you restore the root from the backup using your own tools of choice! -### The trigger file +#### The trigger file When you are starting the database from the backup for the first time, the database must perform a restore procedure. This ensures the data is consistent @@ -229,7 +528,7 @@ the trick: touch /var/lib/questdb/_restore ``` -### Start the database +#### Start the database Start the database using the root directory as usual. When the `_restore` file is present, the database will perform the restore procedure. There are two @@ -260,5 +559,5 @@ section for more information. ## Further reading -To learn more, see the -[`CHECKPOINT` SQL reference documentation](/docs/query/sql/checkpoint/). +- [`BACKUP` SQL reference](/docs/query/sql/backup/) - Enterprise backup command syntax +- [`CHECKPOINT` SQL reference](/docs/query/sql/checkpoint/) - OSS checkpoint command syntax diff --git a/documentation/query/sql/backup.md b/documentation/query/sql/backup.md new file mode 100644 index 000000000..42cb9bbc5 --- /dev/null +++ b/documentation/query/sql/backup.md @@ -0,0 +1,121 @@ +--- +title: BACKUP keyword +sidebar_label: BACKUP +description: "BACKUP SQL keyword reference documentation. Applies to QuestDB Enterprise." +--- + +`BACKUP` - start and abort incremental backups to object storage. + +:::note + +Backup operations are only available in QuestDB Enterprise. + +::: + +_Looking for a detailed guide on backup creation and restoration? Check out our +[Backup and Restore](/docs/operations/backup/) guide!_ + +## Syntax + +```questdb-sql +BACKUP DATABASE; + +BACKUP ABORT; +``` + +### BACKUP DATABASE + +Starts a new incremental backup. Returns immediately with the backup timestamp. +The backup runs asynchronously in the background. + +### BACKUP ABORT + +Aborts a running backup. Returns a single row: + +| Column | Type | Description | +|--------|------|-------------| +| `status` | `VARCHAR` | `aborted` or `not running` | +| `backup_id` | `TIMESTAMP` | Timestamp of aborted backup, or `NULL` | + +Example when backup was running: + +| status | backup_id | +|---------|-----------------------------| +| aborted | 2024-01-15T10:30:00.000000Z | + +Example when no backup was running: + +| status | backup_id | +|-------------|-----------| +| not running | NULL | + +## Monitoring backups + +Use the `backups()` table function to monitor backup progress and history: + +```questdb-sql +SELECT * FROM backups(); +``` + +Returns: + +| Column | Type | Description | +|--------|------|-------------| +| `status` | `VARCHAR` | Current status (see below) | +| `progress_percent` | `INT` | Completion percentage (0-100) | +| `start_ts` | `TIMESTAMP` | When the backup started | +| `end_ts` | `TIMESTAMP` | When the backup completed (NULL if running) | +| `backup_error` | `VARCHAR` | Error message if backup failed | +| `cleanup_error` | `VARCHAR` | Error message if cleanup failed | + +### Status values + +`backup in progress`, `backup complete`, `backup failed`, `cleanup in progress`, +`cleanup complete`, `cleanup failed` + +See [status values](/docs/operations/backup/#monitor-and-abort) in the Backup +guide for descriptions and recommended actions. + +## Examples + +Start a backup: + +```questdb-sql +BACKUP DATABASE; +``` + +Result: + +| backup_timestamp | +|-----------------------------| +| 2024-08-24T12:34:56.789123Z | + +Check current backup status: + +```questdb-sql +SELECT status, progress_percent FROM backups() ORDER BY start_ts DESC LIMIT 1; +``` + +## Configuration + +Backups must be configured before use. At minimum: + +```conf +backup.enabled=true +backup.object.store=s3::bucket=my-bucket;region=eu-west-1;... +``` + +See the [Backup and Restore guide](/docs/operations/backup/#configure) for full +configuration options. + +## Limitations + +- Only one backup can run at a time +- Primary and replica backups are separate (each has its own `backup_instance_name`) + +## See also + +- [Backup and Restore guide](/docs/operations/backup/) - Complete backup + configuration and restore procedures +- [CHECKPOINT](/docs/query/sql/checkpoint/) - Manual checkpoint mode for + QuestDB OSS backups diff --git a/documentation/sidebars.js b/documentation/sidebars.js index 9b63e465d..33b1d4003 100644 --- a/documentation/sidebars.js +++ b/documentation/sidebars.js @@ -320,7 +320,9 @@ module.exports = { { id: "query/sql/acl/assume-service-account", type: "doc", - }, + customProps: { tag: "Enterprise" }, + }, + "query/sql/backup", "query/sql/cancel-query", "query/sql/checkpoint", "query/sql/compile-view", diff --git a/documentation/troubleshooting/error-codes.md b/documentation/troubleshooting/error-codes.md index 75e15dcc7..29914072f 100644 --- a/documentation/troubleshooting/error-codes.md +++ b/documentation/troubleshooting/error-codes.md @@ -101,6 +101,51 @@ You have the following options: * Reconfigure it as `replication.role=replica` and restart it * Perform a planned primary migration and resume the primary role on this instance +## ER007 + +This error indicates a Data ID mismatch between the local database and the backup or replication object store. + +Each QuestDB database has a unique Data ID (stored in `/db/.data_id`) that identifies it for backup and replication purposes. This error occurs when: + +* Attempting to back up to an object store that contains backups from a different database instance +* A replication object store contains data from a different primary instance + +To resolve: + +* **For backup creation**: Verify the `backup.object.store` points to the correct location for this database instance. If intentionally backing up to a new location, the location must be empty. +* **For replication**: Verify the `replication.object.store` configuration matches the primary instance that owns the data. + +Note: Restore operations check for an empty database separately. If the target +database already has a Data ID, restore fails with: "The local database is not +empty. It already has an associated data ID." + +### Starting fresh with a new Data ID + +A Data ID is automatically generated when QuestDB first starts with an empty +database. To intentionally start fresh: + +**Recommended**: Create a new, empty database directory and configure QuestDB +to use it. + +**Alternative**: Delete the existing `.data_id` file (stop QuestDB first): + +```bash +# Stop QuestDB first! +rm /db/.data_id +# Restart QuestDB - a new Data ID will be generated +``` + +:::warning + +Changing the Data ID on a database with existing data will: +- Break any existing replication configuration +- Make existing backups incompatible for restore +- Cause ER007 errors when connecting to the original object store + +::: + +See the [Backup and Restore guide](/docs/operations/backup/) for more information. + # Operating system error codes Refer to the [OS error codes](/docs/troubleshooting/os-error-codes/) page for any