Skip to content

Conversation

andresgalindo-stripe
Copy link
Collaborator

@andresgalindo-stripe andresgalindo-stripe commented Aug 11, 2025

Context

When experiencing a high rate of hash collisions we're observing bad performance in the agent and we suspect the deluge of logs contributes to the problem.

We still need to investigate why we're getting hash collisions (more on that below) but want to try and remediate the bad performance we're seeing first.

Regarding the hash collisions we don't suspect theres anything wrong with xx3 but think the method is receiving a large number of duplicate connections but we're still investigating.

Checklist

  • ./gradlew build compiles code correctly
  • Added new tests where applicable
  • ./gradlew test passes all tests
  • Extended README or added javadocs where applicable

When experiencing a high rate of hash collissions we're observing bad performance in the agent and we suspect the deluge of logs contributes to the problem.
Copy link

github-actions bot commented Aug 11, 2025

Test Results

661 tests  ±0   650 ✅ ±0   9m 14s ⏱️ +12s
152 suites ±0    11 💤 ±0 
152 files   ±0     0 ❌ ±0 

Results for commit 2241869. ± Comparison against base commit ddabb51.

♻️ This comment has been updated with latest results.

Comment on lines 52 to 55
this.metrics = new Metrics.Builder()
.id(metricGroup)
.addCounter("numHashCollisions")
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don' t think this is ever going to get registered. The underlying Router implementation has a getMetrics method that is used to register the metrics. You could override that here but then you would need to re-instantiate numEventsRouted and numEventsProcessed to make sure those metrics are still sent (and probably use a metricsGroup like the parent class to avoid a breaking change).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you need to register with something like "MetricsRegistry.getInstance().registerAndGet"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gracias y'all, used the singleton to register the metrics!

long hash = hashFunction.computeHash(connectionBytes);
if (ring.containsKey(hash)) {
logger.error("Hash collision when computing ring. {} hashed to a value already in the ring.", connectionId + "-" + i);
this.collisionsCounter.increment();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might still be nice to have logging for which connections are duplicates. Could we maybe track at the connection level (instead of the partition level) and emit one log line with all of the duplicate connections (not virtual nodes in the hash ring)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a debug log that should come out in a structure like:

{
  <hash>: [
    0: <first connection with hash>
    1..N: <colliding connections>
  ]
}

I chose a debug log that we'll have to explicitly opt into because I expect the structure to be huge and potentially cause this pausing like behavior we've been seeing today.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and toString() contains all the interesting bits of info we'd need, taken from AsyncConnection.java:

    @Override
    public String toString() {
        return "AsyncConnection [host=" + host + ", port=" + port
            + ", groupId=" + groupId + ", slotId=" + slotId + ", id=" + id
            + "]";
    }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think toString will be called by the logger lib so I don't need to call it myself

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants