Memory store leader election - REAP-2 #1533

adejanovski · 2024-12-12T06:44:52Z

The memory storage backend didn't implement any if the IDistributedStorage leader election methods.
This was ok until we started using leader election for segment scheduling, which doesn't require to run in a distributed mode.
As a consequence, Reaper using the mem store would schedule one new segment at each poll, even if the replicas were already busy processing another segment.

This PR introduces a class which manages locks on replicas for segments and moves the required methods from IDistributedStorage to IStorage so they can be implemented in the memory storage implementation.

tmp

Miles-Garnsey · 2024-12-12T08:44:35Z

@adejanovski this is very tasty but your CI is failing at the moment, and can you remind me how to manually test this :)

Miles-Garnsey

I've put a bunch of feedback as a first review. I'll likely want to consider this more - this isn't an easy PR so nice work on getting something together. It feels like its very close.

I have a bunch of comments around naming since I had a bit of a tricky time understanding some of what was going on at first. These are always nits, but given the functionality is complex I'd love to see some of them straightened out for future readers.

I also think we could handle the locking we're doing much better, and suggest we revisit that now given some of the use cases we want to put this into. For a first cut this is great, but I think a bit of improvement there might do us a world of good. Improvements might include:

Better segmentation of what needs to be locked, since the global lock you're using isn't scalable.
Being more discerning about where you're locking. You probably only need to lock on writes, but you're often acquiring the lock before reads at entry into the method.

Again, this is kind of a minor tweak but I think it'll avoid headaches down the road.

src/server/src/main/java/io/cassandrareaper/storage/IStorageDao.java

src/server/src/main/java/io/cassandrareaper/storage/MemoryStorageFacade.java

Miles-Garnsey · 2024-12-13T01:50:17Z

src/server/src/main/java/io/cassandrareaper/storage/memory/ReplicaLockManagerWithTtl.java

+    try {
+      long currentTime = System.currentTimeMillis();
+      // Check if any replica is already locked by another runId
+      for (String replica : replicas) {


Nit: I think the pattern in this codebase is to the use the streams API instead of for loops where possible?

fair point, I'm refactoring this to use streams.

I think you might have missed this one in your latest commit?

We're good now, no more loop.

Looks like there is still a loop here for (String replica : replicas) { was this something you wanted to change?

src/server/src/main/java/io/cassandrareaper/storage/memory/ReplicaLockManagerWithTtl.java

src/server/src/test/java/io/cassandrareaper/service/RepairRunnerHangingTest.java

src/server/src/main/java/io/cassandrareaper/storage/memory/ReplicaLockManagerWithTtl.java

Miles-Garnsey · 2024-12-16T03:55:52Z

src/server/src/main/java/io/cassandrareaper/storage/memory/ReplicaLockManagerWithTtl.java

+
+  private final ConcurrentHashMap<String, LockInfo> replicaLocks = new ConcurrentHashMap<>();
+  private final ConcurrentHashMap<UUID, Set<UUID>> repairRunToSegmentLocks = new ConcurrentHashMap<>();
+  private final ConcurrentHashMap<UUID, ReentrantLock> runIdLocks = new ConcurrentHashMap<>();


Issue: I thought you said you weren't going to divide the locks up by runID? 😅

yeah, I tried but then rolled back. Not far enough it seems :)

Did you want to roll this back? I thought we were reverting to using a global lock.

src/server/src/main/java/io/cassandrareaper/storage/memory/ReplicaLockManagerWithTtl.java

Miles-Garnsey · 2024-12-16T04:21:41Z

I'm starting to struggle with the length of this review, so let's propose a sketch to do just one method without external locks:

  public boolean lockRunningRepairsForNodes(UUID runId, UUID segmentId, Set<String> replicas) {
      // For each replica, check if a lock is held by another RepairRun.
      for (String replica : replicas) {
        replicaLocks.compute(getReplicaLockKey(replica, runId), (k,lockInfo) -> {
          long currentTime = System.currentTimeMillis();
          if (lockInfo != null && lockInfo.expirationTime > currentTime && lockInfo.runId.equals(runId)) { // lockInfo.runId.equals(runId) looks wrong to me?
            return lockInfo; // Replica is locked by another runId and not expired, return existing value.
          }
          long expirationTime = currentTime + (ttlSeconds * 1000);
          // Lock the replicas for the given runId and segmentId
          LockInfo newLockInfo = new LockInfo(runId, expirationTime);
          newLockInfo.segmentLocks.add(segmentId);
          return newLockInfo;
        });
      }
      return true;
  }

This also requires an update:

private static class LockInfo {
    UUID runId;
    long expirationTime;
    Set<UUID> segmentLocks;
    LockInfo(UUID runId, long expirationTime) {
      this.runId = runId;
      this.expirationTime = expirationTime;
      this.segmentLocks = new HashSet<>();
    }
  }

What this does:

Uses the ConcurrentHashMaps internal locking (which I think should be per key) instead of a global lock.
Moves the segmentLocks into LockInfo which makes clear that they exist within a given RunID (I hope I've understood this correctly, otherwise, I'll have broken things).
Eliminates all external locking requirements. This will just iterate through the replicaLocks independently and update them if possible. It might be good to have a logic branch here for the case lockInfo.expirationTime < currentTime so that this can be removed without waiting for the TTL cleanup to happen, but maybe that can be handled with a simple retry after the other threadpool cleans the entry up.

Miles-Garnsey · 2024-12-16T08:27:17Z

From discussions with Alex earlier:

The proposal I have above where we move Set<UUID> segmentLocks inside LockInfo won't work because segments are not unique to replicas.
This means that the same segment can be running on multiple replicas (in fact, in RF 3 it will be running on all three), and all replicas involved in a particular segment need to be locked.
We discussed the possibility of keying a map by segment with replica ID as the value instead, but Alex pointed out that even then, you may be able to lock one replica, but if you fail to lock all three then you need to roll back any locked ones before proceeding.
Because the Hashmap's mutex can't be used to lock all three replicas at the same time, the unlocking process on one replica then might also fail, which could lead to deadlocks.
Effectively, what we'd need is a way to do a lookup for the three replicas the segment is supposed to run on, then lock all three atomically. I don't think any Map implementation gives us the ability to lock multiple entries at the same time, which is the critical issue here.

I feel like there's some concurrent bimap that might resolve this whole state, but given that I've been reminded that we will currently only be managing one cluster per instance in the proposed use case we have, we can probably defer this more complicated implementation for later.

Miles-Garnsey

I've put more comments as it seems you've said that you've made certain updates in comments but then they aren't all reflected in the code. You may have rolled some back accidentally while rolling back other changes (esp the use of a map of locks).

I'm approving anyway since the tests pass and I don't want you to be blocked from merging pending my review. Whether you make the rest of the changes we've discussed is up to you, since I think they are all non-blocking and this is functionally correct AFAICT.

Miles-Garnsey · 2024-12-16T04:07:32Z

src/server/src/main/java/io/cassandrareaper/storage/memory/ReplicaLockManagerWithTtl.java

+          .map(replica -> replicaLocks.get(getReplicaLockKey(replica, runId)))
+          .anyMatch(lockInfo -> lockInfo != null
+            && lockInfo.expirationTime > currentTime && lockInfo.runId.equals(runId));
+


Issue: This looks wrong, you're saying to return false only if the value of this entry has runId equal to the runId you're trying to lock. Don't you want to return false regardless of who holds the lock?

no, we allow concurrency on replicas across repair runs. Its within a repair run that we disallow it.

src/server/src/main/java/io/cassandrareaper/storage/memory/ReplicaLockManagerWithTtl.java

Miles-Garnsey · 2024-12-18T00:35:58Z

src/server/src/main/java/io/cassandrareaper/storage/memory/ReplicaLockManagerWithTtl.java

+
+  private final ConcurrentHashMap<String, LockInfo> replicaLocks = new ConcurrentHashMap<>();
+  private final ConcurrentHashMap<UUID, Set<UUID>> repairRunToSegmentLocks = new ConcurrentHashMap<>();
+  private final ConcurrentHashMap<UUID, ReentrantLock> runIdLocks = new ConcurrentHashMap<>();


Did you want to roll this back? I thought we were reverting to using a global lock.

src/server/src/main/java/io/cassandrareaper/storage/memory/ReplicaLockManagerWithTtl.java

Miles-Garnsey · 2024-12-18T00:39:24Z

src/server/src/main/java/io/cassandrareaper/storage/memory/ReplicaLockManagerWithTtl.java

+    try {
+      long currentTime = System.currentTimeMillis();
+      // Check if any replica is already locked by another runId
+      for (String replica : replicas) {


Looks like there is still a loop here for (String replica : replicas) { was this something you wanted to change?

adejanovski added 3 commits December 5, 2024 14:21

Add TTLs to the lock manager implementation

42c96b3

tmp

Minimal rework and fix for all tests

83bde15

Fix release lead on segments

d0fb42c

adejanovski changed the title ~~Memory store leader election~~ Memory store leader election - REAP-2 Dec 12, 2024

adejanovski added 2 commits December 12, 2024 16:00

Use a more portable way of defining the tmp dirs

7546ad4

Change timing for some unit test

00f3981

Miles-Garnsey requested changes Dec 13, 2024

View reviewed changes

Address review comments

ae5b482

Miles-Garnsey reviewed Dec 16, 2024

View reviewed changes

Miles-Garnsey requested changes Dec 16, 2024

View reviewed changes

src/server/src/main/java/io/cassandrareaper/storage/memory/ReplicaLockManagerWithTtl.java Show resolved Hide resolved

adejanovski added 2 commits December 17, 2024 13:01

fix failing tests that relied on a bug to succeed

139464a

Fix failing ITs

ab963a7

adejanovski requested a review from Miles-Garnsey December 17, 2024 18:15

Miles-Garnsey approved these changes Dec 18, 2024

View reviewed changes

adejanovski merged commit 664236b into master Dec 18, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory store leader election - REAP-2 #1533

Memory store leader election - REAP-2 #1533

adejanovski commented Dec 12, 2024 •

edited

Loading

Miles-Garnsey commented Dec 12, 2024

Miles-Garnsey left a comment

Miles-Garnsey Dec 13, 2024

adejanovski Dec 13, 2024

Miles-Garnsey Dec 16, 2024

adejanovski Dec 17, 2024

Miles-Garnsey Dec 18, 2024

Miles-Garnsey Dec 16, 2024

adejanovski Dec 16, 2024

Miles-Garnsey Dec 18, 2024

Miles-Garnsey commented Dec 16, 2024 •

edited

Loading

Miles-Garnsey commented Dec 16, 2024 •

edited

Loading

Miles-Garnsey left a comment

Miles-Garnsey Dec 16, 2024

adejanovski Dec 18, 2024

Miles-Garnsey Dec 18, 2024

Miles-Garnsey Dec 18, 2024

Memory store leader election - REAP-2 #1533

Memory store leader election - REAP-2 #1533

Conversation

adejanovski commented Dec 12, 2024 • edited Loading

Miles-Garnsey commented Dec 12, 2024

Miles-Garnsey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Miles-Garnsey commented Dec 16, 2024 • edited Loading

Miles-Garnsey commented Dec 16, 2024 • edited Loading

Miles-Garnsey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adejanovski commented Dec 12, 2024 •

edited

Loading

Miles-Garnsey commented Dec 16, 2024 •

edited

Loading

Miles-Garnsey commented Dec 16, 2024 •

edited

Loading