reactor: more info, robustness on segfault #2691

travisdowns · 2025-03-20T16:21:36Z

On segfault we execute a handler that provides information including a backtrace. This currently emits all information in a single write call after collecting it in a buffer. If anything goes wrong, e.g., the backtrace() call itself crashes, then no information will be emitted. The backtrace() call is not signal safe in theory, and in practice the situation seems mixed as to its safety. So it not unlikely that situations may arise where no output can be emitted on SIGSEGV.

Because we catch the signal and then re-raise it using pthread_kill, the specific information about the IP is lost in re-raise: this prevents the line in syslog which usually captures information about segfaults from appearing at all. So we may be left without useful information after a crash.

In this change, we emit additional information before the backtrace() which is not likely to have any problem, and we emit each as separate write(2) calls so if there is a failure at any point we at least have the information emitted up to that point.

After this, the start of the output on segfault looks like so:

Segmentation fault, si_pid: 0, si_addr: 0000000000000000, ip: 0000579ba959751f
Segmentation fault resolved ip: 0000000005e2751f in [0000579ba3770000+000000000e3f98d8] 
Segmentation fault on shard 0, in scheduling group admin.
Backtrace:
 ....

xemul · 2025-03-24T08:58:06Z

src/core/reactor.cc

+    print_zero_padded_hex_safe(f.so->end - f.so->begin);
+    print_safe("]");
+
+    print_with_backtrace("\nSegmentation fault");


Why not append this \n to the previous line? Like print_safe("]\n");

Good eye, done.

On segfault we execute a handler that provides information including a backtrace. This currently emits all information in a single write call after collecting it in a buffer. If anything goes wrong, e.g., the backtrace() call itself crashes, then no information will be emitted. The backtrace() call is not signal safe in theory, and in practice the situation seems mixed as to its safety. So it not unlikely that situations may arise where no output can be emitted on SIGSEGV. Because we catch the signal and then re-raise it using pthread_kill, the specific information about the IP is lost in re-raise: this prevents the line in syslog which usually captures information about segfaults from appearing at all. So we may be left without useful information after a crash. In this change, we emit additional information before the backtrace() which is not likely to have any problem, and we emit each as separate write(2) calls so if there is a failure at any point we at least have the information emitted up to that point. After this, the start of the output on segfault looks like so: Segmentation fault, si_pid: 0, si_addr: 0000000000000000, ip: 0000579ba959751f Segmentation fault resolved ip: 0000000005e2751f in [0000579ba3770000+000000000e3f98d8] Segmentation fault on shard 0, in scheduling group admin. Followed by the backtrace.

travisdowns · 2025-03-26T14:11:53Z

Thanks for the merge @xemul. This was related to #2697, and I have a follow on PR to add more robustness to the fault handler.

travisdowns marked this pull request as draft March 20, 2025 17:46

travisdowns force-pushed the td-segfault-better branch from 0084158 to a4f9ebd Compare March 20, 2025 17:57

travisdowns marked this pull request as ready for review March 20, 2025 18:13

xemul reviewed Mar 24, 2025

View reviewed changes

travisdowns force-pushed the td-segfault-better branch from a4f9ebd to 3de5dd1 Compare March 24, 2025 16:27

xemul closed this in 320f13a Mar 26, 2025

travisdowns mentioned this pull request Apr 1, 2025

segault logging could be more robust #2710

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reactor: more info, robustness on segfault #2691

reactor: more info, robustness on segfault #2691

travisdowns commented Mar 20, 2025

xemul Mar 24, 2025

travisdowns Mar 24, 2025

travisdowns commented Mar 26, 2025

reactor: more info, robustness on segfault #2691

reactor: more info, robustness on segfault #2691

Conversation

travisdowns commented Mar 20, 2025

xemul Mar 24, 2025

Choose a reason for hiding this comment

travisdowns Mar 24, 2025

Choose a reason for hiding this comment

travisdowns commented Mar 26, 2025