Timeout Warnings #165

faustus123 · 2022-11-02T04:04:43Z

Question from Wouter in ePIC compSW MatterMost channel:

Wouter Deconinck (he/him)
8:49 PM

@david Lawrence On the EICrecon timeout and warmup timeout, is it technically possible to print a warning for instances when we hit, say, over 80% of the time out but still succeed, including some info from the 'recovered' thread about what it was doing that took so long? That would be very helpful in debugging timeouts.

nathanwbrei · 2022-11-03T16:27:34Z

The first part is very reasonable.

The second part is harder since by the time we recognize that we are approaching the timeout, it's too late to turn on the call stack recording to find out which source/factory/processor is taking too long. I'm still thinking through the design implications, but here goes, ordered from "Nathan is optimistic" to "Nathan is pessimistic":

We could just include instructions in the warning message about how to enable call stack recording to get the timing information for each factory/processor the next time they run it. The only downside I see is that users don't read error messages or follow directions contained therein.
We could tweak the call stack recorder so that we always know what our stack is, and expose this information to the arrow timeout mechanism. This will add a bit of overhead. (Note: We would not add this information to the overall call graph unless the parameter record_call_stack=1, which would save us a lot of string compares and memory use.)
We could enable the call stack recording for the next n events and hope that the other threads will also encounter the same slowdown. I don't consider this particularly likely, since the timeout is often caused by loading a big geometry file from a network disk or connecting to a calibrations database over a bad network, which would happen once overall (as opposed to once per thread or 'fresh' event)
We could capture the 'real' stack trace for the slow thread the same we do with the USR1 signal. The problem with this is that these stack traces are usually useless because we've compiled with optimization and without debug symbols. Plus they don't know about source/factory/processor abstractions. Which means that the top 10 or so frames will be deep in the internals of ROOT or whatever JSON library we are using.

I'm going to start by implementing (1) and see what happens when I try for (2).

nathanwbrei · 2024-11-20T02:06:19Z

As of #385, the supervisor thread uses the USR2 signal to obtain a full backtrace from any timed-out workers. It now also logs the arrow name and the event number.

faustus123 self-assigned this Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout Warnings #165

Timeout Warnings #165

faustus123 commented Nov 2, 2022

nathanwbrei commented Nov 3, 2022

nathanwbrei commented Nov 20, 2024

Timeout Warnings #165

Timeout Warnings #165

Comments

faustus123 commented Nov 2, 2022

nathanwbrei commented Nov 3, 2022

nathanwbrei commented Nov 20, 2024