Skip to content

[fix](cloud) Add predecessor instance id#62276

Open
w41ter wants to merge 1 commit intoapache:masterfrom
w41ter:snapshot/fix/add_predecessor_instance_id
Open

[fix](cloud) Add predecessor instance id#62276
w41ter wants to merge 1 commit intoapache:masterfrom
w41ter:snapshot/fix/add_predecessor_instance_id

Conversation

@w41ter
Copy link
Copy Markdown
Contributor

@w41ter w41ter commented Apr 9, 2026

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

When a rollback exists, the predecessor instance is not the same as the source instance. This change separates the two semantics into distinct fields for clarity.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

When a rollback exists, the predecessor instance is not the same as
the source instance. This change separates the two semantics into
distinct fields for clarity.
@w41ter w41ter added the cloud label Apr 9, 2026
@w41ter w41ter requested a review from gavinchou as a code owner April 9, 2026 08:15
@w41ter w41ter added the snapshot label Apr 9, 2026
@w41ter w41ter requested a review from dataroaring as a code owner April 9, 2026 08:15
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 9, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@w41ter
Copy link
Copy Markdown
Contributor Author

w41ter commented Apr 9, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 issue.

  1. cloud/src/resource-manager/resource_manager.cpp: refresh_instance() now evicts rollback predecessor cache entries only when predecessor_instance_id is populated, but no producer writes that field in this PR or in the existing rollback flow. Current rollback instances still carry source_instance_id and original_instance_id, so refreshing the new rollback target leaves the predecessor's nodes in node_info_ until a separate refresh of the predecessor arrives. During that window get_node() returns both entries for the same cloud_unique_id, and get_instance_id() logs a one-to-many warning before picking the last entry.

Critical checkpoint conclusions:

  • Goal of the task: Partially met. The proto now distinguishes snapshot source from rollback predecessor, but the runtime refresh path is not wired end-to-end because the new field is never populated.
  • Scope/minimality: The change is small and focused, but incomplete because only one consumer was updated.
  • Concurrency: refresh_instance() remains protected by mtx_; I did not find a new lock-order or heavy-work-under-lock regression in this patch.
  • Lifecycle/static initialization: Not applicable for this change.
  • Configuration: No config changes.
  • Compatibility: The added proto field is wire-compatible, but the parallel rollback path still relies on the old field semantics, so mixed producer/consumer behavior is inconsistent even within this tree.
  • Parallel code paths: Not all related rollback paths were updated; the predecessor field has no writer yet.
  • Special conditional checks: The new conditional is reasonable only if predecessor_instance_id is guaranteed to be present, which is not true today.
  • Test coverage: Insufficient. No test was added or updated to cover producer plus consumer behavior for predecessor_instance_id; the existing rollback test still sets only source_instance_id and original_instance_id.
  • Observability: Existing warnings in get_instance_id() make the duplicate-cache symptom visible, but there is no direct coverage for this regression.
  • Transaction/persistence: Proto persistence change itself looks straightforward, but the end-to-end metadata flow is incomplete.
  • Data writes/modifications: This affects cached node ownership/routing after rollback refresh and can leave stale predecessor entries visible temporarily.
  • FE/BE variable passing: Not applicable.
  • Performance: No material performance issue beyond duplicate cache entries and redundant lookups caused by the stale predecessor nodes.
  • Other issues: None beyond the inconsistency above.

Overall opinion: the semantic split makes sense, but this PR should also update the rollback producer path (and add coverage) before the consumer switches from source_instance_id to predecessor_instance_id.

// ... or, if is_successor_instance, erase nodes belong to source_instance_id
(!is_successor_instance || i->second.instance_id != source_instance_id)) {
// erase all nodes belong to this instance_id
if (i->second.instance_id == instance_id ||
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This switches cache eviction from source_instance_id to predecessor_instance_id, but I can't find any writer for the new field in the rollback flow. The current chain/rollback paths and the existing RollbackInstance test still populate only source_instance_id and original_instance_id. In that path, refreshing the rollback target stops removing the predecessor's cached nodes, so get_node() can return both old and new entries for the same cloud_unique_id until the predecessor is refreshed separately. get_instance_id() then logs a one-to-many warning and picks the last entry. Please either populate predecessor_instance_id everywhere a rollback successor is created, or keep the old fallback here until all producers are updated and tested.`

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 9, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

PR approved by anyone and no changes requested.

@w41ter
Copy link
Copy Markdown
Contributor Author

w41ter commented Apr 10, 2026

run buildall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. cloud dev/4.1.x reviewed snapshot

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants