Skip to content

Fix ConfigNode NPE when extend/reconstruct region targets a non-DataNode id#18075

Open
CRZbulabula wants to merge 2 commits into
masterfrom
fix-extend-region-npe-invalid-target
Open

Fix ConfigNode NPE when extend/reconstruct region targets a non-DataNode id#18075
CRZbulabula wants to merge 2 commits into
masterfrom
fix-extend-region-npe-invalid-target

Conversation

@CRZbulabula

@CRZbulabula CRZbulabula commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Problem

Running extend region <regionId> to <nodeId> (or reconstruct region <regionId> on <nodeId>) where <nodeId> does not belong to any registered DataNode — e.g. it is a ConfigNode id or simply a non-existent id like 9999 — crashes the ConfigNode RPC handler with a NullPointerException, and the client only sees a misleading error:

Msg: org.apache.iotdb.jdbc.IoTDBSQLException: 305: Error in calling method extendRegion,
because: Fail to connect to any config node. Please check status of ConfigNodes or logs of connected DataNode

ConfigNode log:

ERROR o.a.t.ProcessFunction:47 - Internal error processing extendRegion
java.lang.NullPointerException: Cannot invoke
  "org.apache.iotdb.common.rpc.thrift.TDataNodeLocation.getDataNodeId()" because "targetDataNode" is null
        at org.apache.iotdb.confignode.manager.ProcedureManager.checkExtendRegion(ProcedureManager.java:...)
        at org.apache.iotdb.confignode.manager.ProcedureManager.extendOneRegion(ProcedureManager.java:...)
        ...

Root cause

NodeManager.getRegisteredDataNode(id) is backed by registeredDataNodes.getOrDefault(id, new TDataNodeConfiguration()), so for an unregistered id it returns an empty TDataNodeConfiguration whose location is null. In extendOneRegion / reconstructRegion:

final TDataNodeLocation targetDataNode =
    configManager.getNodeManager().getRegisteredDataNode(req.getDataNodeId()).getLocation(); // -> null

targetDataNode becomes null, and checkExtendRegion / checkReconstructRegion then dereference targetDataNode.getDataNodeId(), throwing the NPE out of the Thrift handler — which the client reports as the generic "Fail to connect to any config node".

Fix

Detect the invalid target at the resolution layer, not inside the check methods. extendOneRegion and reconstructRegion now resolve the target through a small getRegisteredDataNodeLocationOrNull helper and, when the id is not a registered DataNode, return immediately with a clear message:

Target DataNode <id> does not exist in the cluster

This mirrors the existing migrateRegion pattern (which already resolves and validates fromId/toId up front), so checkExtendRegion / checkReconstructRegion only ever receive a non-null target and are left in their original form.

Tests

Added two regression integration tests asserting the operation is rejected cleanly and never surfaces a NullPointerException:

  • IoTDBRegionGroupExpandAndShrinkForIoTV1IT#extendRegionToInvalidDataNodeTest
  • IoTDBRegionReconstructForIoTV1IT#reconstructRegionToInvalidDataNodeTest

(extend/remove region wrap every region's status via processExtendOrRemoveRegions, so the JDBC top-level message shows the aggregate Total regions: 1, ..., failed to submit: 1 and the concrete reason is carried in the per-region sub-status; the extend test asserts on the aggregate + absence of NPE, while the reconstruct test — which returns its TSStatus directly — asserts the full "does not exist in the cluster" message.)

Drive-by: fix a pre-existing flaky test

While validating CI I found IoTDBRemoveDataNodeNormalIT.success1C5DRemoveTwoDataNodesUseSQL (added in #18046) is flaky, unrelated to this change. It removes two randomly chosen DataNodes and asserts success, but if both happen to host a replica of the same consensus group the ConfigNode correctly rejects the request:

Only one replica of the same consensus group is allowed to be migrated at the same time.

Added IoTDBRemoveDataNodeUtils#selectRemoveDataNodesWithoutRegionConflict, which selects DataNodes whose region sets are pairwise disjoint, and use it whenever more than one DataNode is removed.

All three tests pass locally.

…-DataNode id

When the target node id of `extend region` (or `reconstruct region`) does not
belong to any registered DataNode -- for example it is a ConfigNode id or simply
does not exist -- NodeManager.getRegisteredDataNode returns an empty
TDataNodeConfiguration whose location is null. checkExtendRegion and
checkReconstructRegion then dereferenced targetDataNode.getDataNodeId()
unconditionally, throwing a NullPointerException in the ConfigNode RPC handler.
The client only saw a misleading "Fail to connect to any config node" message.

regionOperationCommonCheck already reports "Cannot find Target DataNode" for a
null target, so reorder both checks to follow the safe checkMigrateRegion
pattern: run regionOperationCommonCheck first and short-circuit before any
targetDataNode dereference. The client now receives a clear, correct error.

Add regression ITs for both the extend and reconstruct paths.
@codecov

codecov Bot commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.46%. Comparing base (522c31c) to head (3f26900).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...che/iotdb/confignode/manager/ProcedureManager.java 0.00% 13 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18075      +/-   ##
============================================
- Coverage     41.46%   41.46%   -0.01%     
  Complexity      318      318              
============================================
  Files          5294     5294              
  Lines        371228   371238      +10     
  Branches      48036    48038       +2     
============================================
- Hits         153941   153928      -13     
- Misses       217287   217310      +23     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…-DN IT

Address review feedback: instead of tolerating a null targetDataNode inside
checkExtendRegion / checkReconstructRegion, detect the invalid target at the
resolution layer. extendOneRegion and reconstructRegion now resolve the target
via getRegisteredDataNodeLocationOrNull and, when the id is not a registered
DataNode (a ConfigNode id or a non-existent id), immediately return
"Target DataNode <id> does not exist in the cluster" -- mirroring the existing
migrateRegion pattern -- so the check methods only ever receive a non-null
target. The two check methods are restored to their original form.

Also fix the flaky IoTDBRemoveDataNodeNormalIT.success1C5DRemoveTwoDataNodesUseSQL
(added in #18046): it removed two randomly chosen DataNodes and asserted success,
but if both hosted a replica of the same consensus group the ConfigNode correctly
rejects the request ("Only one replica of the same consensus group is allowed to
be migrated at the same time."). Add selectRemoveDataNodesWithoutRegionConflict,
which picks DataNodes whose region sets are pairwise disjoint, and use it whenever
more than one DataNode is removed.

Verified locally: extendRegionToInvalidDataNodeTest, reconstructRegionToInvalidDataNodeTest
and success1C5DRemoveTwoDataNodesUseSQL all pass.
@sonarqubecloud

sonarqubecloud Bot commented Jul 1, 2026

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant