Skip to content

Fix unconnected Curator client handling when registry check is false#16329

Open
Zhengcy05 wants to merge 2 commits into
apache:3.3from
Zhengcy05:fix/failed-to-safely-close-the-failed-connection
Open

Fix unconnected Curator client handling when registry check is false#16329
Zhengcy05 wants to merge 2 commits into
apache:3.3from
Zhengcy05:fix/failed-to-safely-close-the-failed-connection

Conversation

@Zhengcy05

Copy link
Copy Markdown

Fixes #16186

What is the purpose of the change?

This PR fixes an issue in the multi-ZooKeeper registry scenario when one registry is unavailable and configured with check=false.

Based on the issue report, in the dubbo 3.3 branch, if multiple ZooKeeper registries are configured and one of them is not started with check=false, Curator5ZookeeperClient may keep an unconnected Curator client alive after blockUntilConnected(...) returns false. As a result, the following registry workflow still treats that client as usable and continues trying to create nodes on it. This can cause repeated failures in node creation, significantly slow down application startup, and eventually lead to startup failure.

From the call path, the problem has two parts:

  1. Curator5ZookeeperClient does not fail fast enough on connection failure
    If blockUntilConnected(...) returns false, the ZooKeeper client has not been successfully established. This should always be treated as a client creation failure, and the underlying Curator client should be closed immediately so it cannot be reused later.
    The check=false semantic should only control whether the upper layer continues to throw when registry creation fails. It should not change whether the low-level ZooKeeper client itself is considered successfully created.

  2. ListenerRegistryWrapper is not null-safe when the upper layer returns a null registry
    In AbstractRegistryFactory#getRegistry(), when registry creation fails and check=false, Dubbo logs a warning and returns null.
    However, RegistryFactoryWrapper still wraps that result, and methods such as ListenerRegistryWrapper#getUrl() previously dereferenced the delegate directly. This could later trigger a NullPointerException in paths like RegistryDirectory#subscribe().

To address this, this PR includes two parts:

  • Fix the connection failure handling in Curator5ZookeeperClient

    • When blockUntilConnected(...) == false, always close the client first and then throw an exception.
    • Remove the dependency on check in this low-level failure path.
    • Keep the client creation semantic consistent: if the connection is not established, client creation fails.
  • Add null-safe handling to ListenerRegistryWrapper

    • Add a constructor that preserves the original URL.
    • Return the original URL from getUrl() when registry == null.
    • Return false from isAvailable() when registry == null.
    • Make destroy() a safe no-op when registry == null.
    • Add null protection to unsubscribe(), isServiceDiscovery(), and lookup().
    • Keep the existing listener callback behavior unchanged.

With these changes:

  • an unconnected ZooKeeper client will not be reused by the following registry workflow;
  • the existing check=false behavior at the registry factory layer remains unchanged;
  • a null registry can still be safely wrapped without introducing extra NPEs.

Tests

This PR adds focused regression coverage for both the low-level client behavior and the upper-layer check=false flow:

  • Curator5ZookeeperClientTest

    • verifies that when blockUntilConnected(...) fails, constructing Curator5ZookeeperClient throws IllegalStateException for both check=true and check=false;
    • verifies that client.close() is invoked on connection failure.
  • AbstractRegistryFactoryTest

    • verifies that when createRegistry(url) fails and check=false, getRegistry(url) returns null, preserving the existing upper-layer behavior.
  • ListenerRegistryWrapperTest

    • verifies that getUrl(), isAvailable(), destroy(), lookup(), subscribe(), and unsubscribe() are safe when registry == null;
    • verifies that listener callbacks are still triggered as expected in the null registry case.
  • RegistryFactoryWrapperTest

    • verifies that when the underlying RegistryFactory#getRegistry(url) returns null, the wrapper still returns the original URL safely and allows subscription-related calls without throwing exceptions.

Verification command used:

mvn -pl dubbo-remoting/dubbo-remoting-zookeeper-curator5,dubbo-registry/dubbo-registry-api \
  -Dtest=Curator5ZookeeperClientTest,ListenerRegistryWrapperTest,RegistryFactoryWrapperTest,AbstractRegistryFactoryTest \
  -Dsurefire.failIfNoSpecifiedTests=false test -q

Checklist

  • Make sure there is a GitHub_issue field for the change.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Write necessary unit-test to verify your logic correction. If the new feature or significant change is committed, please remember to add sample in dubbo samples project.
  • Make sure gitHub actions can pass. Why the workflow is failing and how to fix it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] issues 15271 Refactoring the bug in the registry center

1 participant