fix(cluster): During the service provider's release period, concurrent read routes from consumers were rejected #15883

vqianxiao · 2025-12-19T08:20:39Z

During the service provider's release period, concurrent read routes from consumers were rejected #15881

What is the purpose of the change?

Changing invokerRefreshLock from ReentrantLock to ReentrantReadWriteLock avoids concurrency issues, and using invokerRefreshReadLock avoids lock blocking during high concurrency reads

Checklist

Make sure there is a GitHub_issue field for the change.
Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
Write necessary unit-test to verify your logic correction. If the new feature or significant change is committed, please remember to add sample in dubbo samples project.
Make sure gitHub actions can pass. Why the workflow is failing and how to fix it?

…rantReadWriteLock avoids concurrency issues, and using invokerRefreshReadLock avoids lock blocking during high concurrency reads apache#15881

codecov-commenter · 2025-12-19T08:54:03Z

Codecov Report

❌ Patch coverage is 77.77778% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.73%. Comparing base (405bd9f) to head (84c1f1b).

Files with missing lines	Patch %	Lines
...dubbo/rpc/cluster/directory/AbstractDirectory.java	77.77%	7 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##                3.3   #15883      +/-   ##
============================================
+ Coverage     60.71%   60.73%   +0.02%     
+ Complexity    11769    11767       -2     
============================================
  Files          1948     1948              
  Lines         88732    88748      +16     
  Branches      13379    13381       +2     
============================================
+ Hits          53877    53905      +28     
+ Misses        29325    29313      -12     
  Partials       5530     5530

Flag	Coverage Δ
integration-tests-java21	`32.27% <58.33%> (+0.06%)`	⬆️
integration-tests-java8	`32.35% <58.33%> (+0.01%)`	⬆️
samples-tests-java21	`34.86% <44.44%> (-0.02%)`	⬇️
samples-tests-java8	`32.54% <44.44%> (-0.01%)`	⬇️
unit-tests-java11	`58.97% <63.88%> (-0.02%)`	⬇️
unit-tests-java17	`58.45% <63.88%> (-0.01%)`	⬇️
unit-tests-java21	`58.46% <63.88%> (-0.01%)`	⬇️
unit-tests-java25	`58.43% <63.88%> (+<0.01%)`	⬆️
unit-tests-java8	`58.95% <63.88%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

This pull request addresses a concurrency issue during service provider release periods by upgrading the locking mechanism from ReentrantLock to ReentrantReadWriteLock. This change allows multiple consumer threads to concurrently read routes without blocking each other, while still maintaining exclusive access for write operations.

Key Changes:

Replaced invokerRefreshLock (ReentrantLock) with a ReentrantReadWriteLock and extracted separate read and write lock references
Modified the list() method to use the read lock for concurrent access to invoker lists
Updated all write operations (add/remove invokers, refresh, etc.) to use the write lock

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dubbo-cluster/src/main/java/org/apache/dubbo/rpc/cluster/directory/AbstractDirectory.java

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dubbo-cluster/src/main/java/org/apache/dubbo/rpc/cluster/directory/AbstractDirectory.java

EarthChen

LGTM

RainYuY

I think if we accept this PR, it will lead to the invokers not being refreshed when routing occurs. If the QPS is high, this may cause issues such as dead nodes remaining valid for an extended period. Additionally, I don’t understand why #10925 added this restriction. We need to discuss this further. @AlbumenJ

vqianxiao · 2025-12-24T02:18:08Z

Hi @RainYuY ：
Thank you for your comment, but this is not an improvement, it's a fix. Because in the production environment, we found that after the service provider was published, Dubbo consumers were unable to call the provider's service normally. You can see that #15881 has been called 0 times. I overwritten my modifications with the AbstractDirectory class in the jar package and republished it to consumers. After the service provider published it, consumers can consume normally. We consumers call the provider for about 10wQPS, and a single machine for about 1000QPS. I think this should already be considered a high QPS call.

RainYuY · 2025-12-24T02:39:46Z

Hi @RainYuY ： Thank you for your comment, but this is not an improvement, it's a fix. Because in the production environment, we found that after the service provider was published, Dubbo consumers were unable to call the provider's service normally. You can see that #15881 has been called 0 times. I overwritten my modifications with the AbstractDirectory class in the jar package and republished it to consumers. After the service provider published it, consumers can consume normally. We consumers call the provider for about 10wQPS, and a single machine for about 1000QPS. I think this should already be considered a high QPS call.

So you haven’t encountered the situation where invokers are refreshed late? From my understanding of your code, if a request is being routed, the invoker list cannot be refreshed. As a result, the refresh process will be blocked until routing is completed. However, if routing is ongoing continuously (e.g., a read lock is held persistently), the write lock will take much longer to be acquired.

EarthChen · 2025-12-24T02:59:46Z

**RainYuY **

What @RainYuY is concerned about is that due to the existence of the high-concurrency read-write lock, the invoker list fails to acquire the write lock and thus cannot be updated successfully. In this case, you will keep retrieving the outdated invoker list.

vqianxiao · 2025-12-24T03:06:39Z

So you haven’t encountered the situation where invokers are refreshed late? From my understanding of your code, if a request is being routed, the invoker list cannot be refreshed. As a result, the refresh process will be blocked until routing is completed. However, if routing is ongoing continuously (e.g., a read lock is held persistently), the write lock will take much longer to be acquired.

you are right. I found that during the release period of a service provider, the overall Dubbo call time increased due to lock blocking until the service provider completed the release. So, I wonder if there is a better way to solve this problem, but currently all I can think of is to lock it first to ensure that the call can be made normally instead of being unable to call directly

EarthChen · 2025-12-24T03:10:10Z

So you haven’t encountered the situation where invokers are refreshed late? From my understanding of your code, if a request is being routed, the invoker list cannot be refreshed. As a result, the refresh process will be blocked until routing is completed. However, if routing is ongoing continuously (e.g., a read lock is held persistently), the write lock will take much longer to be acquired.

you are right. I found that during the release period of a service provider, the overall Dubbo call time increased due to lock blocking until the service provider completed the release. So, I wonder if there is a better way to solve this problem, but currently all I can think of is to lock it first to ensure that the call can be made normally instead of being unable to call directly

I think a solution that is more oriented to AP would be to remove the validation between the new and old invoker lists to ensure availability. However, this validation was added via a separate PR submitted by another PMC member, so we need to confirm the intention behind this modification.

RainYuY · 2025-12-24T03:10:43Z

So you haven’t encountered the situation where invokers are refreshed late? From my understanding of your code, if a request is being routed, the invoker list cannot be refreshed. As a result, the refresh process will be blocked until routing is completed. However, if routing is ongoing continuously (e.g., a read lock is held persistently), the write lock will take much longer to be acquired.

you are right. I found that during the release period of a service provider, the overall Dubbo call time increased due to lock blocking until the service provider completed the release. So, I wonder if there is a better way to solve this problem, but currently all I can think of is to lock it first to ensure that the call can be made normally instead of being unable to call directly

I think a solution that is more oriented to AP would be to remove the validation between the new and old invoker lists to ensure availability. However, this validation was added via a separate PR submitted by another PMC member, so we need to confirm the intention behind this modification.

I don’t have a better solution yet and I’m still thinking about it. But I’m wondering why this restriction exists, so I’m waiting for Kevin to give me an answer LOL. If I don’t get a reply, I’ll call him this Friday ^v^. @AlbumenJ

RainYuY · 2025-12-31T08:29:58Z

So you haven’t encountered the situation where invokers are refreshed late? From my understanding of your code, if a request is being routed, the invoker list cannot be refreshed. As a result, the refresh process will be blocked until routing is completed. However, if routing is ongoing continuously (e.g., a read lock is held persistently), the write lock will take much longer to be acquired.

you are right. I found that during the release period of a service provider, the overall Dubbo call time increased due to lock blocking until the service provider completed the release. So, I wonder if there is a better way to solve this problem, but currently all I can think of is to lock it first to ensure that the call can be made normally instead of being unable to call directly

@vqianxiao I have checked this. This check is designed for the multiple chain design, so it cannot be removed. However, your issue also needs to be addressed. I still have concerns about this scenario, but after careful consideration, I think implementing a fair lock would be a viable solution—meaning write locks will be acquired before read locks attempt to acquire them. Although this may result in some performance overhead, it is acceptable for the sake of data consistency.
Do you have time to verify whether the lock works as we designed it in high QPS scenarios? If it does not function correctly, we must consider refactoring this lock. Additionally, if you have a better solution, we will give it high priority for consideration. This issue is currently an important one that we need to address soon.

vqianxiao · 2026-01-07T07:15:58Z

So you haven’t encountered the situation where invokers are refreshed late? From my understanding of your code, if a request is being routed, the invoker list cannot be refreshed. As a result, the refresh process will be blocked until routing is completed. However, if routing is ongoing continuously (e.g., a read lock is held persistently), the write lock will take much longer to be acquired.

you are right. I found that during the release period of a service provider, the overall Dubbo call time increased due to lock blocking until the service provider completed the release. So, I wonder if there is a better way to solve this problem, but currently all I can think of is to lock it first to ensure that the call can be made normally instead of being unable to call directly

@vqianxiao I have checked this. This check is designed for the multiple chain design, so it cannot be removed. However, your issue also needs to be addressed. I still have concerns about this scenario, but after careful consideration, I think implementing a fair lock would be a viable solution—meaning write locks will be acquired before read locks attempt to acquire them. Although this may result in some performance overhead, it is acceptable for the sake of data consistency. Do you have time to verify whether the lock works as we designed it in high QPS scenarios? If it does not function correctly, we must consider refactoring this lock. Additionally, if you have a better solution, we will give it high priority for consideration. This issue is currently an important one that we need to address soon.

I tested the fair lock and found that there would be some fluctuations in time consumption during the release period, but there would never be a situation where the entire service consumer could not be invoked at all and had to be restarted

RainYuY · 2026-01-07T07:51:45Z

So you haven’t encountered the situation where invokers are refreshed late? From my understanding of your code, if a request is being routed, the invoker list cannot be refreshed. As a result, the refresh process will be blocked until routing is completed. However, if routing is ongoing continuously (e.g., a read lock is held persistently), the write lock will take much longer to be acquired.

you are right. I found that during the release period of a service provider, the overall Dubbo call time increased due to lock blocking until the service provider completed the release. So, I wonder if there is a better way to solve this problem, but currently all I can think of is to lock it first to ensure that the call can be made normally instead of being unable to call directly

@vqianxiao I have checked this. This check is designed for the multiple chain design, so it cannot be removed. However, your issue also needs to be addressed. I still have concerns about this scenario, but after careful consideration, I think implementing a fair lock would be a viable solution—meaning write locks will be acquired before read locks attempt to acquire them. Although this may result in some performance overhead, it is acceptable for the sake of data consistency. Do you have time to verify whether the lock works as we designed it in high QPS scenarios? If it does not function correctly, we must consider refactoring this lock. Additionally, if you have a better solution, we will give it high priority for consideration. This issue is currently an important one that we need to address soon.

I tested the fair lock and found that there would be some fluctuations in time consumption during the release period, but there would never be a situation where the entire service consumer could not be invoked at all and had to be restarted

So how about modifying your Pull Request to use a fair lock?

RainYuY

LGTM,I think this PR need more tests.

RainYuY · 2026-01-07T13:48:46Z

@AlbumenJ PTAL

wangwei added 2 commits December 19, 2025 16:08

fix(cluster): Changing invokerRefreshLock from ReentrantLock to Reent…

2f13e77

…rantReadWriteLock avoids concurrency issues, and using invokerRefreshReadLock avoids lock blocking during high concurrency reads apache#15881

Formatting

a9fb820

Merge branch '3.3' into feat-rejectRouter

cd609a5

EarthChen requested a review from Copilot December 22, 2025 06:23

Copilot started reviewing on behalf of EarthChen December 22, 2025 06:23 View session

Copilot AI reviewed Dec 22, 2025

View reviewed changes

wangwei and others added 2 commits December 22, 2025 18:48

Fix the issues raised by Copilot

2cb32de

Merge branch '3.3' into feat-rejectRouter

9425631

EarthChen requested a review from Copilot December 22, 2025 12:15

Copilot started reviewing on behalf of EarthChen December 22, 2025 12:17 View session

Copilot AI reviewed Dec 22, 2025

View reviewed changes

dubbo-cluster/src/main/java/org/apache/dubbo/rpc/cluster/directory/AbstractDirectory.java Show resolved Hide resolved

dubbo-cluster/src/main/java/org/apache/dubbo/rpc/cluster/directory/AbstractDirectory.java Outdated Show resolved Hide resolved

exception message

3a5272e

EarthChen previously approved these changes Dec 23, 2025

View reviewed changes

vqianxiao and others added 3 commits December 23, 2025 17:37

Merge branch '3.3' into feat-rejectRouter

a097c95

Merge branch '3.3' into feat-rejectRouter

b9bb870

test case

a322354

vqianxiao dismissed EarthChen’s stale review via a322354 December 23, 2025 10:44

RainYuY added the type/discussion Everything related with code discussion or question label Dec 23, 2025

RainYuY requested changes Dec 23, 2025

View reviewed changes

Merge branch '3.3' into feat-rejectRouter

ef3eef3

RainYuY added status:need-discussion and removed type/discussion Everything related with code discussion or question labels Dec 31, 2025

RainYuY added the type/bug Bugs to being fixed label Dec 31, 2025

wangwei and others added 2 commits January 7, 2026 17:45

use fair lock

af439ee

Merge branch '3.3' into feat-rejectRouter

55345b1

vqianxiao closed this Jan 7, 2026

vqianxiao reopened this Jan 7, 2026

RainYuY approved these changes Jan 7, 2026

View reviewed changes

RainYuY requested a review from AlbumenJ January 7, 2026 13:49

songxiaosheng added the status/wait for another approve label Jan 7, 2026

vqianxiao added 2 commits January 8, 2026 17:45

Merge branch '3.3' into feat-rejectRouter

d029dde

Merge branch '3.3' into feat-rejectRouter

84c1f1b

fix(cluster): During the service provider's release period, concurrent read routes from consumers were rejected #15883

Are you sure you want to change the base?

fix(cluster): During the service provider's release period, concurrent read routes from consumers were rejected #15883

Conversation

vqianxiao commented Dec 19, 2025

What is the purpose of the change?

Checklist

Uh oh!

codecov-commenter commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

EarthChen left a comment

Choose a reason for hiding this comment

Uh oh!

RainYuY left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vqianxiao commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RainYuY commented Dec 24, 2025

Uh oh!

EarthChen commented Dec 24, 2025

Uh oh!

vqianxiao commented Dec 24, 2025

Uh oh!

EarthChen commented Dec 24, 2025

Uh oh!

RainYuY commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RainYuY commented Dec 31, 2025

Uh oh!

vqianxiao commented Jan 7, 2026

Uh oh!

RainYuY commented Jan 7, 2026

Uh oh!

RainYuY left a comment

Choose a reason for hiding this comment

Uh oh!

RainYuY commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Dec 19, 2025 •

edited

Loading

RainYuY left a comment •

edited

Loading

vqianxiao commented Dec 24, 2025 •

edited

Loading

RainYuY commented Dec 24, 2025 •

edited

Loading