[fix](cloud) Add predecessor instance id#62276
Conversation
When a rollback exists, the predecessor instance is not the same as the source instance. This change separates the two semantics into distinct fields for clarity.
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
Found 1 issue.
cloud/src/resource-manager/resource_manager.cpp:refresh_instance()now evicts rollback predecessor cache entries only whenpredecessor_instance_idis populated, but no producer writes that field in this PR or in the existing rollback flow. Current rollback instances still carrysource_instance_idandoriginal_instance_id, so refreshing the new rollback target leaves the predecessor's nodes innode_info_until a separate refresh of the predecessor arrives. During that windowget_node()returns both entries for the samecloud_unique_id, andget_instance_id()logs a one-to-many warning before picking the last entry.
Critical checkpoint conclusions:
- Goal of the task: Partially met. The proto now distinguishes snapshot source from rollback predecessor, but the runtime refresh path is not wired end-to-end because the new field is never populated.
- Scope/minimality: The change is small and focused, but incomplete because only one consumer was updated.
- Concurrency:
refresh_instance()remains protected bymtx_; I did not find a new lock-order or heavy-work-under-lock regression in this patch. - Lifecycle/static initialization: Not applicable for this change.
- Configuration: No config changes.
- Compatibility: The added proto field is wire-compatible, but the parallel rollback path still relies on the old field semantics, so mixed producer/consumer behavior is inconsistent even within this tree.
- Parallel code paths: Not all related rollback paths were updated; the predecessor field has no writer yet.
- Special conditional checks: The new conditional is reasonable only if
predecessor_instance_idis guaranteed to be present, which is not true today. - Test coverage: Insufficient. No test was added or updated to cover producer plus consumer behavior for
predecessor_instance_id; the existing rollback test still sets onlysource_instance_idandoriginal_instance_id. - Observability: Existing warnings in
get_instance_id()make the duplicate-cache symptom visible, but there is no direct coverage for this regression. - Transaction/persistence: Proto persistence change itself looks straightforward, but the end-to-end metadata flow is incomplete.
- Data writes/modifications: This affects cached node ownership/routing after rollback refresh and can leave stale predecessor entries visible temporarily.
- FE/BE variable passing: Not applicable.
- Performance: No material performance issue beyond duplicate cache entries and redundant lookups caused by the stale predecessor nodes.
- Other issues: None beyond the inconsistency above.
Overall opinion: the semantic split makes sense, but this PR should also update the rollback producer path (and add coverage) before the consumer switches from source_instance_id to predecessor_instance_id.
| // ... or, if is_successor_instance, erase nodes belong to source_instance_id | ||
| (!is_successor_instance || i->second.instance_id != source_instance_id)) { | ||
| // erase all nodes belong to this instance_id | ||
| if (i->second.instance_id == instance_id || |
There was a problem hiding this comment.
This switches cache eviction from source_instance_id to predecessor_instance_id, but I can't find any writer for the new field in the rollback flow. The current chain/rollback paths and the existing RollbackInstance test still populate only source_instance_id and original_instance_id. In that path, refreshing the rollback target stops removing the predecessor's cached nodes, so get_node() can return both old and new entries for the same cloud_unique_id until the predecessor is refreshed separately. get_instance_id() then logs a one-to-many warning and picks the last entry. Please either populate predecessor_instance_id everywhere a rollback successor is created, or keep the old fallback here until all producers are updated and tested.`
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
run buildall |
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
When a rollback exists, the predecessor instance is not the same as the source instance. This change separates the two semantics into distinct fields for clarity.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)