Skip to content

Improve ReplicationValve fault tolerance in clustered environments#968

Closed
YCharanGowda wants to merge 10 commits intoapache:mainfrom
YCharanGowda:fix/replication-valve-fault-tolerance
Closed

Improve ReplicationValve fault tolerance in clustered environments#968
YCharanGowda wants to merge 10 commits intoapache:mainfrom
YCharanGowda:fix/replication-valve-fault-tolerance

Conversation

@YCharanGowda
Copy link
Copy Markdown

Problem

In Apache Tomcat's clustering feature, the ReplicationValve handles sending session replication messages to cluster nodes. Previously, all send operations (invalid sessions, session replication, and cross-context sessions) were wrapped in a single try-catch block. If any send failed (e.g., due to network issues or node unavailability), the exception would propagate and skip all remaining sends, reducing cluster reliability and potentially causing data inconsistencies across nodes.

This was noted in a FIXME comment: "we have a lot of sends, but the trouble with one node stops the correct replication to other nodes!"

Solution

  • Split the single try-catch block in sendReplicationMessage() into individual try-catch blocks for each send operation (sendInvalidSessions, sendSessionReplicationMessage, and sendCrossContextSession).
  • This ensures that if one send fails, the others continue, improving fault tolerance without changing successful behavior.
  • Added specific error messages in LocalStrings.properties for better logging and diagnostics:
    • ReplicationValve.send.replication.failure
    • ReplicationValve.send.crosscontext.failure
  • Removed the FIXME comments since the issue is now addressed.

Files Changed

  • java/org/apache/catalina/ha/tcp/ReplicationValve.java: Modified sendReplicationMessage() method
  • java/org/apache/catalina/ha/tcp/LocalStrings.properties: Added new error message keys

Impact

  • Positive: Enhances robustness in high-availability setups where network failures are common.
  • Risk: Low – no functional changes for successful sends; only improves error handling.
  • Backward Compatible: Yes, no breaking changes.

Testing

  • Verified compilation without errors
  • Changes are minimal and isolated to error handling paths
  • Recommend testing in a clustered environment to confirm sends continue on failure

Related

  • Resolves FIXME in ReplicationValve.java (line 373)

YCHARAN and others added 9 commits March 10, 2026 18:52
@rmaucher
Copy link
Copy Markdown
Contributor

  • Having 3 different messages is not really needed, let's keep only one.
  • About the core idea, why not I guess. I would think if the cluster is not working, none of the three operations is going to work. Also the cluster state is bad if any of these fail.
    So: maybe but the actual improvement is very questionable :/

- Use a single error message key for all replication send failures
- Simplifies error logging and reduces message duplication
- Aligns with reviewer feedback to keep only one message
@YCharanGowda
Copy link
Copy Markdown
Author

Updated based on feedback:

  • Consolidated 3 separate error messages into 1 generic message
  • Removed specific message keys from LocalStrings.properties
  • Simplifies logging while maintaining error handling

This change directly addresses the feedback to "keep only one message."

@rmaucher
Copy link
Copy Markdown
Contributor

I merged a derivative. I'm not convinced this adds much, but why not.

@rmaucher rmaucher closed this Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants