Skip to content

HDDS-15327. Proactively clear failed replication commands in SCM#10540

Open
chihsuan wants to merge 10 commits into
apache:masterfrom
chihsuan:HDDS-15327
Open

HDDS-15327. Proactively clear failed replication commands in SCM#10540
chihsuan wants to merge 10 commits into
apache:masterfrom
chihsuan:HDDS-15327

Conversation

@chihsuan

@chihsuan chihsuan commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Problem: When a replication or EC reconstruction command fails on a datanode (a transient network blip, a busy node, etc.), SCM is never told. The pending "ADD" op keeps counting against the inflight replication quota until its deadline expires, which defaults to 12 minutes (hdds.scm.replication.event.timeout).

That stale entry causes two problems:

  1. The cluster-wide inflight count fills up with dead entries, so SCM stops scheduling new replication even while the datanodes sit idle.
  2. The affected container is not retried, because the health check still thinks an ADD is in flight for that replica.

During decommission this stalls progress badly: thousands of commands are issued, and even a small failure rate leaks enough stale entries to block the cluster for up to 12 minutes at a time.

Fix: Let the datanode report when a replication/reconstruction command finishes, so SCM clears the op immediately instead of waiting for the timeout. This re-introduces the command-status feedback path that was removed in HDDS-1368, with no Protobuf/wire change (the CommandStatus message already has FAILED, cmdId, and type).

  • Datanode: reports EXECUTED/FAILED for replicateContainerCommand and reconstructECContainersCommand, exactly the way deleteBlocksCommand already does. Tasks that are skipped, ignored before running (deadline passed, not in service, stale SCM term), or dropped before queueing (queue full -> FAILED, duplicate -> EXECUTED) also report a terminal status, so no PENDING entry is left behind in the datanode's command-status map. Tasks with no backing SCM command (e.g. reconcile) are unaffected.
  • SCM: CommandStatusReportHandler fires a new REPLICATION_STATUS event for failed commands; a dedicated ReplicationStatusHandler (leader-only, mirroring the delete-block path) consumes it and clears the matching pending op via the new ContainerReplicaPendingOps#onReplicationCommandFailed(cmdId), decrementing the inflight counter and freeing the scheduled size. Both problems above are resolved at once.

Compatibility is graceful: an old datanode against a new SCM just never sends the report and falls back to the 12-minute timeout; a new datanode against an old SCM has the status ignored, as before.

Follow-up (separate HDDS Jira): every failure is currently rescheduled like a timeout. A later change can carry a failure reason in the existing CommandStatus.msg field (no wire change) so transient failures retry promptly while queue-full failures back off, avoiding a resend/drop loop under heavy load. This touches the ReplicationManager retry policy, so it is kept out of this surgical change.

Flow

sequenceDiagram
    autonumber
    participant SCMcmd as SCM (ReplicationManager)
    participant DN as Datanode (StateContext, ReplicationSupervisor)
    participant Pub as CommandStatusReportPublisher
    participant H as CommandStatusReportHandler
    participant Ops as ContainerReplicaPendingOps

    Note over SCMcmd,Ops: replicateContainerCommand / reconstructECContainersCommand

    SCMcmd->>DN: send ADD command (cmdId)
    Note right of SCMcmd: pendingOps records ADD<br/>commandIdToContainer[cmdId] = container<br/>inflight quota consumed
    DN->>DN: addCmdStatus registers PENDING(cmdId)

    DN->>DN: TaskRunner.run() executes task

    alt task FAILED or early-return
        DN->>DN: updateCommandStatus(cmdId, markAsFailed)
        Pub->>H: heartbeat report {cmdId: FAILED}
        H->>H: filter FAILED replicate/reconstruct, fire REPLICATION_STATUS
        H->>Ops: ReplicationStatusHandler (leader only) calls onReplicationCommandFailed(cmdId)
        Ops->>Ops: remove ADD op, decrement inflight,<br/>release scheduled size, drop index entry
        Ops-->>SCMcmd: notifySubscribers(timedOut=true), op freed and rescheduled next RM cycle
    else task DONE or SKIPPED (success)
        DN->>DN: updateCommandStatus(cmdId, markAsExecuted)
        Pub->>H: heartbeat report {cmdId: EXECUTED}
        H->>H: EXECUTED not routed (debug log only)
        Note over Ops: op already cleared earlier by<br/>container report completeOp()
    end

    Note over Pub: PENDING entries stay in the map and are<br/>re-sent each interval until resolved, so every<br/>exit path must mark EXECUTED or FAILED
Loading

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15327

How was this patch tested?

New and updated unit tests:

  • TestContainerReplicaPendingOps: a failed command removes the matching ADD op and decrements the inflight counter; an unknown command id is a no-op.
  • TestCommandStatusReportHandler: a FAILED replication status fires REPLICATION_STATUS.
  • TestStateContext: replicate/reconstruct commands register a PENDING status.
  • TestReplicationSupervisor: a finished task reports EXECUTED on success and FAILED on failure; skipped, deadline-passed, queue-full, and duplicate tasks all drain their status entry (no leftover PENDING).
  • TestReplicationStatusHandler: the leader clears the pending op on a failed status; a follower does not.

Local CI-aligned checks all pass: checkstyle.sh, rat.sh, author.sh.

Generated-by: Claude Code (Claude Opus 4.8)

@chihsuan chihsuan marked this pull request as ready for review June 18, 2026 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant