Skip to content

Conversation

@pcholakov
Copy link
Contributor

This change adds a basic time interval configuration option for triggering automated snapshots. When both time and record-based intervals are set, the logic for the trigger condition is an "and" between the two, so the record interval becomes a minimum requirement for time-based snapshotting.

Time interval-based snapshotting will get a lot more useful with incremental snapshot support, but this is already a big improvement as it allows operators to configure snapshots in slightly more intuitive terms of elapsed time.

The time interval is measured from the created-at timestamp of the published snapshot, which is based on the wall clock time of the producing node. Clock drift error is assumed to be relatively negligible compared to typical snapshotting intervals. A small amount of jitter is added to each partition's snapshot age check to spread snapshot uploads out over time.

Configuration examples

Periodic + minimum number of records

[worker.snapshots]
destination = "s3://bucket/cluster-name"
snapshot-interval-num-records = 10000
snapshot-interval = "30 min"

Translates into: "snapshot every 30 minutes, as long as the applied LSN has advanced by least 10,000 records since the last snapshot".

Periodic (unconditional)

[worker.snapshots]
destination = "s3://bucket/cluster-name"
snapshot-interval = "24 hours"

Translates into: "snapshot once every 24 hours even if zero new changes are committed".

@github-actions
Copy link

github-actions bot commented Oct 21, 2025

Test Results

 5 files   -   2   5 suites   - 2   1m 37s ⏱️ - 1m 3s
34 tests  -  13  34 ✅  -  13  0 💤 ±0  0 ❌ ±0 
52 runs   - 148  52 ✅  - 148  0 💤 ±0  0 ❌ ±0 

Results for commit a15f9db. ± Comparison against base commit 1265be4.

This pull request removes 47 and adds 34 tests. Note that renamed tests count towards both.
dev.restate.sdktesting.tests.CallOrdering ‑ ordering(boolean[], Client)[1]
dev.restate.sdktesting.tests.CallOrdering ‑ ordering(boolean[], Client)[2]
dev.restate.sdktesting.tests.CallOrdering ‑ ordering(boolean[], Client)[3]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[1]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[2]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[3]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[1]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[2]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[3]
dev.restate.sdktesting.tests.Combinators ‑ awakeableOrTimeoutUsingAwakeableTimeoutCommand(Client)
…
dev.restate.sdktesting.tests.AwakeableIngressEndpointTest ‑ completeWithFailure(Client)
dev.restate.sdktesting.tests.AwakeableIngressEndpointTest ‑ completeWithSuccess(Client)
dev.restate.sdktesting.tests.BackwardCompatibilityTest$NewVersion ‑ completeAwakeable(Client)
dev.restate.sdktesting.tests.BackwardCompatibilityTest$NewVersion ‑ completeRetryableOperation(Client)
dev.restate.sdktesting.tests.BackwardCompatibilityTest$NewVersion ‑ proxyCallShouldBeDone(Client)
dev.restate.sdktesting.tests.BackwardCompatibilityTest$NewVersion ‑ proxyOneWayCallShouldBeDone(Client)
dev.restate.sdktesting.tests.BackwardCompatibilityTest$OldVersion ‑ createAwakeable(Client)
dev.restate.sdktesting.tests.BackwardCompatibilityTest$OldVersion ‑ startOneWayProxyCall(Client)
dev.restate.sdktesting.tests.BackwardCompatibilityTest$OldVersion ‑ startProxyCall(Client)
dev.restate.sdktesting.tests.BackwardCompatibilityTest$OldVersion ‑ startRetryableOperation(Client)
…

♻️ This comment has been updated with latest results.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this PR @pcholakov. The changes look good to me. +1 for merging. Do we need to update our docs to also mention the snapshot_interval?

}

/// Retrieve the latest known LSN to be archived to the snapshot repository.
/// Response of `Ok(Lsn::INVALID)` indicates no existing snapshot for the partition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe update the comment.

match self {
ArchivedLsn::None => Ok(Duration::MAX),
ArchivedLsn::Snapshot { created_at, .. } => {
SystemTime::now().duration_since(*created_at)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important that we distinguish between created_at > now and created_at == now? If not, then we might get rid fo the error case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants