HDDS-15327. Proactively clear failed replication commands in SCM#10540
Open
chihsuan wants to merge 10 commits into
Open
HDDS-15327. Proactively clear failed replication commands in SCM#10540chihsuan wants to merge 10 commits into
chihsuan wants to merge 10 commits into
Conversation
Follower SCMs no longer clear pending ops on stray command-status reports, matching the existing leader guard in DeletedBlockLogImpl.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Problem: When a replication or EC reconstruction command fails on a datanode (a transient network blip, a busy node, etc.), SCM is never told. The pending "ADD" op keeps counting against the inflight replication quota until its deadline expires, which defaults to 12 minutes (
hdds.scm.replication.event.timeout).That stale entry causes two problems:
During decommission this stalls progress badly: thousands of commands are issued, and even a small failure rate leaks enough stale entries to block the cluster for up to 12 minutes at a time.
Fix: Let the datanode report when a replication/reconstruction command finishes, so SCM clears the op immediately instead of waiting for the timeout. This re-introduces the command-status feedback path that was removed in HDDS-1368, with no Protobuf/wire change (the
CommandStatusmessage already hasFAILED,cmdId, andtype).EXECUTED/FAILEDforreplicateContainerCommandandreconstructECContainersCommand, exactly the waydeleteBlocksCommandalready does. Tasks that are skipped, ignored before running (deadline passed, not in service, stale SCM term), or dropped before queueing (queue full ->FAILED, duplicate ->EXECUTED) also report a terminal status, so noPENDINGentry is left behind in the datanode's command-status map. Tasks with no backing SCM command (e.g. reconcile) are unaffected.CommandStatusReportHandlerfires a newREPLICATION_STATUSevent for failed commands; a dedicatedReplicationStatusHandler(leader-only, mirroring the delete-block path) consumes it and clears the matching pending op via the newContainerReplicaPendingOps#onReplicationCommandFailed(cmdId), decrementing the inflight counter and freeing the scheduled size. Both problems above are resolved at once.Compatibility is graceful: an old datanode against a new SCM just never sends the report and falls back to the 12-minute timeout; a new datanode against an old SCM has the status ignored, as before.
Follow-up (separate HDDS Jira): every failure is currently rescheduled like a timeout. A later change can carry a failure reason in the existing
CommandStatus.msgfield (no wire change) so transient failures retry promptly while queue-full failures back off, avoiding a resend/drop loop under heavy load. This touches the ReplicationManager retry policy, so it is kept out of this surgical change.Flow
sequenceDiagram autonumber participant SCMcmd as SCM (ReplicationManager) participant DN as Datanode (StateContext, ReplicationSupervisor) participant Pub as CommandStatusReportPublisher participant H as CommandStatusReportHandler participant Ops as ContainerReplicaPendingOps Note over SCMcmd,Ops: replicateContainerCommand / reconstructECContainersCommand SCMcmd->>DN: send ADD command (cmdId) Note right of SCMcmd: pendingOps records ADD<br/>commandIdToContainer[cmdId] = container<br/>inflight quota consumed DN->>DN: addCmdStatus registers PENDING(cmdId) DN->>DN: TaskRunner.run() executes task alt task FAILED or early-return DN->>DN: updateCommandStatus(cmdId, markAsFailed) Pub->>H: heartbeat report {cmdId: FAILED} H->>H: filter FAILED replicate/reconstruct, fire REPLICATION_STATUS H->>Ops: ReplicationStatusHandler (leader only) calls onReplicationCommandFailed(cmdId) Ops->>Ops: remove ADD op, decrement inflight,<br/>release scheduled size, drop index entry Ops-->>SCMcmd: notifySubscribers(timedOut=true), op freed and rescheduled next RM cycle else task DONE or SKIPPED (success) DN->>DN: updateCommandStatus(cmdId, markAsExecuted) Pub->>H: heartbeat report {cmdId: EXECUTED} H->>H: EXECUTED not routed (debug log only) Note over Ops: op already cleared earlier by<br/>container report completeOp() end Note over Pub: PENDING entries stay in the map and are<br/>re-sent each interval until resolved, so every<br/>exit path must mark EXECUTED or FAILEDWhat is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15327
How was this patch tested?
New and updated unit tests:
TestContainerReplicaPendingOps: a failed command removes the matching ADD op and decrements the inflight counter; an unknown command id is a no-op.TestCommandStatusReportHandler: a FAILED replication status firesREPLICATION_STATUS.TestStateContext: replicate/reconstruct commands register a PENDING status.TestReplicationSupervisor: a finished task reportsEXECUTEDon success andFAILEDon failure; skipped, deadline-passed, queue-full, and duplicate tasks all drain their status entry (no leftoverPENDING).TestReplicationStatusHandler: the leader clears the pending op on a failed status; a follower does not.Local CI-aligned checks all pass:
checkstyle.sh,rat.sh,author.sh.Generated-by: Claude Code (Claude Opus 4.8)