-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hanging server #2041
Comments
(For completeness' sake, the homepage and eventlog were still loadable, but were totally frozen data-wise, as if cached. The info on the homepage didn't change for the 3 hours it was hung. All individual test pages, as well as test creation page, perpetually 504'd per vondele's comments.) |
Is it theoretically possible for one of the threads to deadlock at that point? Curious... EDIT: I guess yes if |
Ok Not sure if this a smoking gun but it would probably be safer to first release |
Not sure if this is related:
It's not clear to me or vondele but perhaps the various overloads, and resulting stale tasks requiring cleanup, increased the odds of a hang...? (But if so, I would have expected the earlier, worse overload to hang rather than the second lesser one) (before the first overload, we had roughly 70k cores. between the first and second overload, we never really had more than 62k cores, and kinda bounced between 62k and down to 48k or so at low points. the hang occurred at 51812 cores, around 10-15 minutes after the lesser overload, as stated before.) (both overloads and resulting stale tasks could be dug out of the event log, should one desire to.) |
Ok I think I understand. When The main issue is that both The solution is to protect |
This bug was my fault (introduced in #2020)... Sorry about that. I guess it would be useful to do an audit of all locks being used in Fishtest and the order in which they are acquired to avoid such cycles in the future. |
Tonight I will make a PR with a fix. |
While I'm not excluding that there is a bug in that PR, similar hangs have been observed prior to that PR. There might thus be similar issues lurking, you might want to keep an eye open for that. However, this is the first time we have this kind of trace info, and I'm happy to see it turns out to be useful. We know what we can do if it happens again.. |
Current RunDb lock acquire/release graph
|
Which tool did you use for this? |
It is manually done by hand, no tools. |
We use a separate lock to update aggregates. To this end we extend self.active_run_lock() with an optional argument "name" to be able to have different locks associated with the same run_id.
I made a PR which presumably will fix this issue. See #2042 . However the semantics of the various locks used in Fishtest should be clarified. |
We use a separate lock to update aggregates. To this end we extend self.active_run_lock() with an optional argument "name" to be able to have different locks associated with the same run_id.
flush_buffers: syncs oldest cache entry (period: 1s) clean_cache: evicts old cache entries (period: 60s) scavenge_dead_tasks: (period: 60s). This PR should also fix the deadlock official-stockfish#2041.
flush_buffers: syncs oldest cache entry (period: 1s) clean_cache: evicts old cache entries (period: 60s) scavenge_dead_tasks: (period: 60s). This PR should also fix the deadlock official-stockfish#2041.
flush_buffers: syncs oldest cache entry (period: 1s) clean_cache: evicts old cache entries (period: 60s) scavenge_dead_tasks: (period: 60s). This PR should also fix the deadlock official-stockfish#2041.
flush_buffers: syncs oldest cache entry (period: 1s) clean_cache: evicts old cache entries (period: 60s) scavenge_dead_tasks: (period: 60s). This PR should also fix the deadlock official-stockfish#2041.
flush_buffers: syncs oldest cache entry (period: 1s) clean_cache: evicts old cache entries (period: 60s) scavenge_dead_tasks: (period: 60s). This PR should also fix the deadlock #2041.
overnight the server went in a hanging state. The signature as we have observed previously. pserve 6545 at 100% load.
Using the new facility to get traces, I sampled both 6543 and 6545. Interestingly, 6545 doesn't show any thread active in a part of the fishtest python code. 6543 however seems to be deadlocked, or at least in fixed patterns in which all threads in fishtest code are waiting on locks.
I have the feeling 6543 is in a deadlock situation.
The text was updated successfully, but these errors were encountered: