Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout Warnings #165

Open
faustus123 opened this issue Nov 2, 2022 · 2 comments
Open

Timeout Warnings #165

faustus123 opened this issue Nov 2, 2022 · 2 comments
Assignees

Comments

@faustus123
Copy link
Collaborator

Question from Wouter in ePIC compSW MatterMost channel:

Wouter Deconinck (he/him)
8:49 PM

@david Lawrence On the EICrecon timeout and warmup timeout, is it technically possible to print a warning for instances when we hit, say, over 80% of the time out but still succeed, including some info from the 'recovered' thread about what it was doing that took so long? That would be very helpful in debugging timeouts.

@nathanwbrei
Copy link
Collaborator

The first part is very reasonable.

The second part is harder since by the time we recognize that we are approaching the timeout, it's too late to turn on the call stack recording to find out which source/factory/processor is taking too long. I'm still thinking through the design implications, but here goes, ordered from "Nathan is optimistic" to "Nathan is pessimistic":

  1. We could just include instructions in the warning message about how to enable call stack recording to get the timing information for each factory/processor the next time they run it. The only downside I see is that users don't read error messages or follow directions contained therein.

  2. We could tweak the call stack recorder so that we always know what our stack is, and expose this information to the arrow timeout mechanism. This will add a bit of overhead. (Note: We would not add this information to the overall call graph unless the parameter record_call_stack=1, which would save us a lot of string compares and memory use.)

  3. We could enable the call stack recording for the next n events and hope that the other threads will also encounter the same slowdown. I don't consider this particularly likely, since the timeout is often caused by loading a big geometry file from a network disk or connecting to a calibrations database over a bad network, which would happen once overall (as opposed to once per thread or 'fresh' event)

  4. We could capture the 'real' stack trace for the slow thread the same we do with the USR1 signal. The problem with this is that these stack traces are usually useless because we've compiled with optimization and without debug symbols. Plus they don't know about source/factory/processor abstractions. Which means that the top 10 or so frames will be deep in the internals of ROOT or whatever JSON library we are using.

I'm going to start by implementing (1) and see what happens when I try for (2).

@faustus123 faustus123 self-assigned this Nov 7, 2022
@nathanwbrei
Copy link
Collaborator

As of #385, the supervisor thread uses the USR2 signal to obtain a full backtrace from any timed-out workers. It now also logs the arrow name and the event number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants