Skip to content

Conversation

@WenyXu
Copy link
Member

@WenyXu WenyXu commented Nov 19, 2025

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

#6558

What's changed and what's your intention?

This PR introduces EnterStagingRequest as a new region operation that enables safe partition rule changes by transitioning regions into staging mode with isolated manifest management. This allows zero-downtime partition expression updates while maintaining data consistency.

The EnterStagingRequest follows a carefully designed flow:

  1. Check Current State: If already in staging mode, validate partition expression matches
  2. Flush Memtable: If memtable contains data, flush it first with FlushReason::EnterStaging
  3. Clear Staging Directory(asynchronous): Remove any existing staging manifest files
  4. Create Staging Manifest(asynchronous): Write new manifest with updated partition expression
  5. Switch State: Transition region to Staging mode
  6. Resume Writes: Allow pending write requests to continue

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.
  • API changes are backward compatible.
  • Schema or data changes are backward compatible.

@github-actions github-actions bot added size/M docs-not-required This change does not impact docs. labels Nov 19, 2025
@WenyXu WenyXu force-pushed the feat/enter-staging branch from 7c4ec53 to 4400b66 Compare November 19, 2025 10:54
@WenyXu WenyXu mentioned this pull request Nov 19, 2025
21 tasks
@WenyXu WenyXu force-pushed the feat/enter-staging branch from 4400b66 to 7a3f9f6 Compare November 19, 2025 11:21
@github-actions github-actions bot added size/L and removed size/M labels Nov 19, 2025
@WenyXu WenyXu force-pushed the feat/enter-staging branch 2 times, most recently from fd2f7be to c512891 Compare November 19, 2025 11:30
@WenyXu WenyXu force-pushed the feat/enter-staging branch from c512891 to bb51c1d Compare November 19, 2025 11:41
@WenyXu WenyXu marked this pull request as ready for review November 19, 2025 11:41
@WenyXu WenyXu requested review from a team, evenyag, v0y4g3r and waynexia as code owners November 19, 2025 11:41
@evenyag evenyag requested a review from Copilot November 21, 2025 07:08
Copilot finished reviewing on behalf of evenyag November 21, 2025 07:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces the EnterStagingRequest API for the RegionEngine, enabling safe partition rule changes through a carefully orchestrated staging mode transition. The implementation adds a new intermediate EnteringStaging state that stalls writes during the transition, flushes memtables with the old partition expression, creates isolated staging manifests with the new partition expression, and then resumes writes in staging mode.

Key changes include:

  • New EnterStagingRequest API and RegionLeaderState::EnteringStaging state for safe partition expression updates
  • Modified manifest update signature to use is_staging: bool parameter instead of region state
  • Enhanced flush handling with new FlushReason::EnterStaging variant and is_staging parameter propagation

Reviewed Changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/store-api/src/region_request.rs Adds EnterStagingRequest struct and extends RegionRequest enum with EnterStaging variant
src/mito2/src/worker/handle_enter_staging.rs New file implementing core enter staging logic including request handling, manifest creation, and result processing
src/mito2/src/worker/handle_write.rs Adds write stalling support for EnteringStaging state to queue requests during transition
src/mito2/src/region.rs Adds EnteringStaging state, staging_partition_expr field, and state transition methods
src/mito2/src/manifest/manager.rs Changes update() signature to use is_staging boolean, adds staging manifest tracking and clear_staging_manifests()
src/mito2/src/manifest/checkpointer.rs Simplifies maybe_do_checkpoint() by removing state parameter (staging check moved to caller)
src/mito2/src/flush.rs Adds FlushReason::EnterStaging variant and is_staging field to RegionFlushTask
src/mito2/src/worker/handle_flush.rs Propagates is_staging parameter and updates flush handling to skip metadata updates in staging mode
src/mito2/src/worker/handle_manifest.rs Updates all update_manifest() calls to pass is_staging parameter
src/mito2/src/worker/handle_alter.rs Updates flush task creation to pass is_staging parameter
src/mito2/src/worker.rs Adds EnterStaging request handling and background notification processing
src/mito2/src/region/opener.rs Initializes staging_partition_expr field and refactors manifest options creation
src/mito2/src/manifest/storage.rs Adds manifest_dir() helper function
src/mito2/src/error.rs Adds StagingPartitionExprMismatch error for partition expression validation
src/mito2/src/engine/staging_test.rs Comprehensive tests for enter staging functionality including partition expression validation and write stalling
src/mito2/src/engine/listener.rs Adds NotifyEnterStagingResultListener for test coordination
src/mito2/src/request.rs Adds EnterStagingRequest to worker request types and EnterStagingResult for background notifications
src/metric-engine/src/engine.rs Adds EnterStaging request handling for physical regions
src/datanode/src/region_server.rs Maps EnterStaging request to RegionChange::None
Test files (multiple) Updates existing tests to use new manifest update signature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

e;
"Failed to clear staging manifest files for region {}",
region.region_id
);
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When clearing staging manifests fails, the function returns early but the region is still in the EnteringStaging state (set on line 104). This leaves the region in an inconsistent state where it's marked as entering staging but the operation failed. The region state should be reverted back to Writable before returning, or this error should trigger a state cleanup.

Suggested change
);
);
// Revert region state to Writable before returning.
region.set_state(RegionLeaderState::Writable);

Copilot uses AI. Check for mistakes.

let version = region.version();
if !version.memtables.is_empty() {
// If memtable is empty, we can't enter staging directly and need to flush
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 64 has incorrect logic. It says "If memtable is empty, we can't enter staging directly" but the condition checks !version.memtables.is_empty(), which means it's checking if memtable is NOT empty. The comment should say "If memtable is not empty, we can't enter staging directly and need to flush all memtables first."

Suggested change
// If memtable is empty, we can't enter staging directly and need to flush
// If memtable is not empty, we can't enter staging directly and need to flush

Copilot uses AI. Check for mistakes.
};

if let Err(e) = region.switch_state_to_staging(RegionLeaderState::EnteringStaging) {
error!(e; "Failed to switch region state to entering staging");
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message on line 211 says "Failed to switch region state to entering staging", but at this point the function is actually switching from EnteringStaging to Staging state. The error message should say "Failed to switch region state to staging" to accurately reflect what's happening.

Suggested change
error!(e; "Failed to switch region state to entering staging");
error!(e; "Failed to switch region state to staging");

Copilot uses AI. Check for mistakes.
common_runtime::spawn_global(async move {
let mut manager = region.manifest_ctx.manifest_manager.write().await;
if let Err(e) = manager.clear_staging_manifests().await {
error!(e; "Failed to clear staging manifests after failed to switch region state to entering staging");
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message says "Failed to clear staging manifests after failed to switch region state to entering staging". This is misleading because the code is called when switching from EnteringStaging to Staging fails, not when entering the EnteringStaging state. It should say "Failed to clear staging manifests after failed to switch region state to staging".

Suggested change
error!(e; "Failed to clear staging manifests after failed to switch region state to entering staging");
error!(e; "Failed to clear staging manifests after failed to switch region state to staging");

Copilot uses AI. Check for mistakes.
normal_count_before, 1,
"Normal manifest directory should initially contain one file"
// One file for table creation
// One files for flush operation
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammatical error: "One files" should be "One file" (singular form).

Suggested change
// One files for flush operation
// One file for flush operation

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required This change does not impact docs. size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant