You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix - Under load and during topology changes, thread saturation can occur, causing a lockup (#2139)
* Add endpoint manager test to repro thread lockup
fix merge
* add explanation and dockerfile
* Block endpoint while it disposes instead of holding requests behind a lock. this also allows messages to other endpoints while one is disposing, as well as multiple endpoint disposes at the same time.
* Change EndpointManager Stop to StopAsync
* ensure the EndpointReader always finishes up the request after sending the DisconnectRequest, so it doesn't time out during kestrel shutdown
* increase timeout on a couple tests so they fail less often
* increase another timeout on flakey test
// This program tests lockup issues with the EndpointManager.
10
+
// TL:DR; This is to demonstrate the issue with the locking and blocking waits in EndpointManager, and to confirm the fix.
11
+
//
12
+
// This recreates a scenario we were seeing in our production environments.
13
+
// What we saw was 30 cluster clients were sending many messages to 2 of the cluster members, who were sending messages to eachother depending
14
+
// on actor placement. If something happens and the 2 members had to reboot, they would end up locking up, not being able to do anything.
15
+
// This scenario has been recreated more simply here, where you have 2 members sending many messages back and forth, a disconnect comes through
16
+
// from a member that recently restarted, and new connections are being opened to other members. Putting all these together, we end up in a situation
17
+
// where many threads get stuck at a lock in EndpointManager, while the one thread inside of the lock is waiting for a ServerConnector to stop.
18
+
// NOTE: that this can be a bit flakey as we are trying to reproduce a complete thread lockup. So there is a dockerfile to run it in a more consistent
19
+
// environment. Using `--cpus="1"` with docker will make it even more consistent, but sometimes it takes a few tries to repro.
20
+
// You will know you reproduced it when you stop seeing "This should log every second." every second. you may also see the built in
21
+
// "ThreadPool is running hot" log, but the absence of that log is ambiguous, since if it's locked up it won't finish to log how long it took!
22
+
// The other indicator is that all the new connections made at the end should be logging terminations and reconnects and quickly give up (since they don't exist),
23
+
// but of course that won't be happening when you're locked up. Also seeing any "terminating" messages without a corresponding "terminated" message
0 commit comments