Improve ReplicationValve fault tolerance in clustered environments#968
Closed
YCharanGowda wants to merge 10 commits intoapache:mainfrom
Closed
Improve ReplicationValve fault tolerance in clustered environments#968YCharanGowda wants to merge 10 commits intoapache:mainfrom
YCharanGowda wants to merge 10 commits intoapache:mainfrom
Conversation
- Wrap each replication send operation in individual try-catch blocks to prevent one failure from halting all subsequent sends - Add specific error messages for replication and cross-context send failures - Remove FIXME comments as the issue is resolved
Contributor
|
- Use a single error message key for all replication send failures - Simplifies error logging and reduces message duplication - Aligns with reviewer feedback to keep only one message
Author
|
Updated based on feedback:
This change directly addresses the feedback to "keep only one message." |
Contributor
|
I merged a derivative. I'm not convinced this adds much, but why not. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In Apache Tomcat's clustering feature, the
ReplicationValvehandles sending session replication messages to cluster nodes. Previously, all send operations (invalid sessions, session replication, and cross-context sessions) were wrapped in a singletry-catchblock. If any send failed (e.g., due to network issues or node unavailability), the exception would propagate and skip all remaining sends, reducing cluster reliability and potentially causing data inconsistencies across nodes.This was noted in a FIXME comment: "we have a lot of sends, but the trouble with one node stops the correct replication to other nodes!"
Solution
try-catchblock insendReplicationMessage()into individualtry-catchblocks for each send operation (sendInvalidSessions,sendSessionReplicationMessage, andsendCrossContextSession).LocalStrings.propertiesfor better logging and diagnostics:ReplicationValve.send.replication.failureReplicationValve.send.crosscontext.failureFiles Changed
java/org/apache/catalina/ha/tcp/ReplicationValve.java: ModifiedsendReplicationMessage()methodjava/org/apache/catalina/ha/tcp/LocalStrings.properties: Added new error message keysImpact
Testing
Related
ReplicationValve.java(line 373)