Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZOOKEEPER-4858: Remove the lock contention between snapshotting and the sync operation #2185

Merged
merged 1 commit into from
Dec 21, 2024

Conversation

li4wang
Copy link
Contributor

@li4wang li4wang commented Sep 7, 2024

Author: Li Wang [email protected]

@li4wang
Copy link
Contributor Author

li4wang commented Sep 12, 2024

@eolivelli can I get a review for this? Thanks

Copy link
Contributor

@anmolnar anmolnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are u sure it's safe just removing it?
takeSnapshot and restoreSnapshot should be two mutually exclusive operation, you probably need to introduce some more specific lock mechanism to avoid object-level locking.

@li4wang
Copy link
Contributor Author

li4wang commented Oct 16, 2024

Thanks for reviewing it, @anmolnar.

Yes, we would still want exclusive locking between snapshot and restore to make sure data tree is not replaced by restoring operation while snapshotting is in progress.

@li4wang
Copy link
Contributor Author

li4wang commented Oct 16, 2024

added a specific lock just for snapshot and restore operations. @anmolnar would you mind taking a look at it. Thanks.

@li4wang
Copy link
Contributor Author

li4wang commented Oct 21, 2024

@anmolnar I've addressed your comment. Can you take a quick look at it? Thanks.

@anmolnar
Copy link
Contributor

anmolnar commented Dec 9, 2024

Are you sure that we don't need to synchronize between snapshotting and sync()?

Comment on lines +625 to +628
synchronized (snapshotAndRestoreLock) {
// set to the new zkDatabase
setZKDatabase(newZKDatabase);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you better extend the scope of sync block to createSessionTracker() call.

Copy link
Contributor Author

@li4wang li4wang Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    protected void createSessionTracker() {
        sessionTracker = new SessionTrackerImpl(this, zkDb.getSessionWithTimeOuts(), tickTime, createSessionTrackerServerId, getZooKeeperServerListener());
    }
 public File takeSnapshot(boolean syncSnap, boolean isSevere) throws IOException {
        try {
            synchronized (snapshotAndRestoreLock) {
                snapFile = txnLogFactory.save(zkDb.getDataTree(), zkDb.getSessionWithTimeOuts(), syncSnap);
            }
        } catch (IOException e) {
  public ConcurrentHashMap<Long, Integer> getSessionWithTimeOuts() {
        return sessionsWithTimeouts;
    }

takeSapshot() and createSessionTracker() of resore() both read data from the sessionsWithTimeouts ConcurrentHashMap, any reason the sync block need to include it?

createSessionTracker() is a post process after swapping the ZkDatabase object of a ZookeeperServer instance. sessionsWithTimeouts is part of ZkDatabase object and read/write sync is already covered by the sync block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anmolnar
Copy link
Contributor

anmolnar commented Dec 9, 2024

@kezhuw Thoughts?

@anmolnar
Copy link
Contributor

We ran into perf issue when had sync on the ZookeeperServer object. Whenever snapshot was taken, the sync operations were queued up in the outstanding request queue and the sync() operation latency was largely increased. (edited)
There was no synchronize between snapshotting and sync() before.
The intention of adding the synchronized was to takeSnapshot() and restore() was to sync between the two operations. I didn’t realize the sync() call also requires ZookeeperServer object lock.

When has synchronized been added to take/restore snapshot?
Are you saying that sync() was already synchronized which should prevent multiple sync's running at the same time, but accidentally you introduced snapshotting into this lock?

@li4wang
Copy link
Contributor Author

li4wang commented Dec 12, 2024

When has synchronized been added to take/restore snapshot?

synchronized on the ZookeeperServer object was added when we introduced remote backup and restore feature, specifically #1943

Are you saying that sync() was already synchronized which should prevent multiple sync's running at the same time, but accidentally you introduced snapshotting into this lock?

Yes. I didn't know that sync() is synchronized on FollowerZookeeperServer, which means it's also synchronized onZookeeperServer object. That's why we had lock contention between takingSnapshot() and sync().

Copy link
Contributor

@anmolnar anmolnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for pulling this too long. Lgtm.

Copy link
Member

@kezhuw kezhuw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

it causes lock contention on the ZookeeperServer object with the sync operation.

Does anyone have any cue about why sync get synchronized ? I saw that sync is called only in one thread and pendingSyncs is concurrent safe.

@anmolnar
Copy link
Contributor

LGTM

it causes lock contention on the ZookeeperServer object with the sync operation.

Does anyone have any cue about why sync get synchronized ? I saw that sync is called only in one thread and pendingSyncs is concurrent safe.

I have no idea either. I think we could remove synchronize from sync() too. Or remove it only from sync().

@anmolnar
Copy link
Contributor

ZOOKEEPER-136 added synchronized to sync()

@li4wang
Copy link
Contributor Author

li4wang commented Dec 20, 2024

Thanks @anmolnar and @kezhuw

I saw that sync is called only in one thread and pendingSyncs is concurrent safe.
I think we could remove synchronize from sync() too. Or remove it only from sync().

sync() is synchronized on the ZookeeperServer object, which is accessed by multiple-threads. Through inheritance and composition, the sync() operation is currently synchronized with following API calls.

ZookeeperServer.enqueueRequest(), 
ZookeeperServer.submitRequestNow()
ZookeeperServer.startup()
ZookeeperServer.shutdown()
Learner.syncWithLeader()

sync() is used for strong consistency of not getting stale data in the read after write scenario. Not sure if it's safe to remove synchronization from it.

@li4wang
Copy link
Contributor Author

li4wang commented Dec 20, 2024

@anmolnar @kezhuw The PR is good for merge now, right? Thanks.

@kezhuw kezhuw merged commit c0e9241 into apache:master Dec 21, 2024
14 checks passed
kezhuw pushed a commit that referenced this pull request Dec 21, 2024
…he sync operation

Reviewers: anmolnar, kezhuw
Author: li4wang
Closes #2185 from li4wang/ZOOKEEPER-4858

(cherry picked from commit c0e9241)
Signed-off-by: Kezhu Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants