Fix ConfigNode NPE when extend/reconstruct region targets a non-DataNode id by CRZbulabula · Pull Request #18075 · apache/iotdb

CRZbulabula · 2026-06-30T17:35:38Z

Problem

Running extend region <regionId> to <nodeId> (or reconstruct region <regionId> on <nodeId>) where <nodeId> does not belong to any registered DataNode — e.g. it is a ConfigNode id or simply a non-existent id like 9999 — crashes the ConfigNode RPC handler with a NullPointerException, and the client only sees a misleading error:

Msg: org.apache.iotdb.jdbc.IoTDBSQLException: 305: Error in calling method extendRegion,
because: Fail to connect to any config node. Please check status of ConfigNodes or logs of connected DataNode

ConfigNode log:

ERROR o.a.t.ProcessFunction:47 - Internal error processing extendRegion
java.lang.NullPointerException: Cannot invoke
  "org.apache.iotdb.common.rpc.thrift.TDataNodeLocation.getDataNodeId()" because "targetDataNode" is null
        at org.apache.iotdb.confignode.manager.ProcedureManager.checkExtendRegion(ProcedureManager.java:...)
        at org.apache.iotdb.confignode.manager.ProcedureManager.extendOneRegion(ProcedureManager.java:...)
        ...

Root cause

NodeManager.getRegisteredDataNode(id) is backed by registeredDataNodes.getOrDefault(id, new TDataNodeConfiguration()), so for an unregistered id it returns an empty TDataNodeConfiguration whose location is null. In extendOneRegion / reconstructRegion:

final TDataNodeLocation targetDataNode =
    configManager.getNodeManager().getRegisteredDataNode(req.getDataNodeId()).getLocation(); // -> null

targetDataNode becomes null, and checkExtendRegion / checkReconstructRegion then dereference targetDataNode.getDataNodeId(), throwing the NPE out of the Thrift handler — which the client reports as the generic "Fail to connect to any config node".

Fix

Detect the invalid target at the resolution layer, not inside the check methods. extendOneRegion and reconstructRegion now resolve the target through a small getRegisteredDataNodeLocationOrNull helper and, when the id is not a registered DataNode, return immediately with a clear message:

Target DataNode <id> does not exist in the cluster

This mirrors the existing migrateRegion pattern (which already resolves and validates fromId/toId up front), so checkExtendRegion / checkReconstructRegion only ever receive a non-null target and are left in their original form.

Tests

Added two regression integration tests asserting the operation is rejected cleanly and never surfaces a NullPointerException:

IoTDBRegionGroupExpandAndShrinkForIoTV1IT#extendRegionToInvalidDataNodeTest
IoTDBRegionReconstructForIoTV1IT#reconstructRegionToInvalidDataNodeTest

(extend/remove region wrap every region's status via processExtendOrRemoveRegions, so the JDBC top-level message shows the aggregate Total regions: 1, ..., failed to submit: 1 and the concrete reason is carried in the per-region sub-status; the extend test asserts on the aggregate + absence of NPE, while the reconstruct test — which returns its TSStatus directly — asserts the full "does not exist in the cluster" message.)

Drive-by: fix a pre-existing flaky test

While validating CI I found IoTDBRemoveDataNodeNormalIT.success1C5DRemoveTwoDataNodesUseSQL (added in #18046) is flaky, unrelated to this change. It removes two randomly chosen DataNodes and asserts success, but if both happen to host a replica of the same consensus group the ConfigNode correctly rejects the request:

Only one replica of the same consensus group is allowed to be migrated at the same time.

Added IoTDBRemoveDataNodeUtils#selectRemoveDataNodesWithoutRegionConflict, which selects DataNodes whose region sets are pairwise disjoint, and use it whenever more than one DataNode is removed.

All three tests pass locally.

…-DataNode id When the target node id of `extend region` (or `reconstruct region`) does not belong to any registered DataNode -- for example it is a ConfigNode id or simply does not exist -- NodeManager.getRegisteredDataNode returns an empty TDataNodeConfiguration whose location is null. checkExtendRegion and checkReconstructRegion then dereferenced targetDataNode.getDataNodeId() unconditionally, throwing a NullPointerException in the ConfigNode RPC handler. The client only saw a misleading "Fail to connect to any config node" message. regionOperationCommonCheck already reports "Cannot find Target DataNode" for a null target, so reorder both checks to follow the safe checkMigrateRegion pattern: run regionOperationCommonCheck first and short-circuit before any targetDataNode dereference. The client now receives a clear, correct error. Add regression ITs for both the extend and reconstruct paths.

codecov · 2026-06-30T18:24:36Z

Codecov Report

❌ Patch coverage is 0% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.46%. Comparing base (522c31c) to head (3f26900).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
...che/iotdb/confignode/manager/ProcedureManager.java	0.00%	13 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18075      +/-   ##
============================================
- Coverage     41.46%   41.46%   -0.01%     
  Complexity      318      318              
============================================
  Files          5294     5294              
  Lines        371228   371238      +10     
  Branches      48036    48038       +2     
============================================
- Hits         153941   153928      -13     
- Misses       217287   217310      +23

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…-DN IT Address review feedback: instead of tolerating a null targetDataNode inside checkExtendRegion / checkReconstructRegion, detect the invalid target at the resolution layer. extendOneRegion and reconstructRegion now resolve the target via getRegisteredDataNodeLocationOrNull and, when the id is not a registered DataNode (a ConfigNode id or a non-existent id), immediately return "Target DataNode <id> does not exist in the cluster" -- mirroring the existing migrateRegion pattern -- so the check methods only ever receive a non-null target. The two check methods are restored to their original form. Also fix the flaky IoTDBRemoveDataNodeNormalIT.success1C5DRemoveTwoDataNodesUseSQL (added in #18046): it removed two randomly chosen DataNodes and asserted success, but if both hosted a replica of the same consensus group the ConfigNode correctly rejects the request ("Only one replica of the same consensus group is allowed to be migrated at the same time."). Add selectRemoveDataNodesWithoutRegionConflict, which picks DataNodes whose region sets are pairwise disjoint, and use it whenever more than one DataNode is removed. Verified locally: extendRegionToInvalidDataNodeTest, reconstructRegionToInvalidDataNodeTest and success1C5DRemoveTwoDataNodesUseSQL all pass.

sonarqubecloud · 2026-07-01T02:28:17Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ConfigNode NPE when extend/reconstruct region targets a non-DataNode id#18075

Fix ConfigNode NPE when extend/reconstruct region targets a non-DataNode id#18075
CRZbulabula wants to merge 2 commits into
masterfrom
fix-extend-region-npe-invalid-target

CRZbulabula commented Jun 30, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

CRZbulabula commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Tests

Drive-by: fix a pre-existing flaky test

Uh oh!

codecov Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud Bot commented Jul 1, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CRZbulabula commented Jun 30, 2026 •

edited

Loading

codecov Bot commented Jun 30, 2026 •

edited

Loading