Skip to content

[FIX] Suppression of stacktrace on a shutdown #187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 3, 2025

Conversation

wallashss
Copy link
Collaborator

This PR adds a workaround to suppress logs that are dumped at the shutdown of the engine (only on V1) when vllm runs with multiprocess. The undesired behavior happens because g3log from Spyre stack/runtime overrides the signal handler from vLLM when it starts a process for the engine code. Therefore, the engine does not have a chance to gracefully shutdown.

Copy link

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

@wallashss
Copy link
Collaborator Author

bot:test
TEST_FILE=tests/e2e/test_spyre_basic.py MARKERS="spyre"

Signed-off-by: Wallas Santos <[email protected]>
@joerunde
Copy link
Collaborator

joerunde commented Jun 2, 2025

I've had a bit of experience messing with signal handlers, and the annoying problem is that setting a handler with signal.signal(sig, handler) simply overrides any handler that was previously registered for that signal. One thing I worked on back on caikit was a signal handler that would iteratively call all handlers registered for that signal, instead of just blanket overriding whatever the current handler is: https://github.com/caikit/caikit/blob/ce3fa2c129ce15a5e2095d466a8f01ec2e0c577d/caikit/runtime/server_base.py#L179
But of course, all handlers for a signal had to be added with that utility :(

Maybe we can use a similar strategy here though, to setup a handler that calls the vllm one, plus whatever is already registered:

def make_handler(signum):
    original_hander = signal.getsignal(signum)
    def handler(signal_, stack_):
        vllm.run_engine_core.signal_handler(signal_, stack_)
        original_hander(signal_, stack_)

    return handler

...

signal.signal(SIGTERM, make_handler(SIGTERM))

However, if we think that the one set by the g3log is not required for proper shutdown, then we should probably just override it to the vllm one and call it a day

@tjohnson31415
Copy link
Collaborator

I did some investigation. As I understand it, Python's getsignal would only work to see what handlers have been registered with Python. But g3log is a C++ library, so Python knows nothing about its signal registration. Other than re-compiling g3log or writing a small C++ wrapper, I don't think we can call out to its signal handler or changes its behavior.

So I think the change here to override the signal handler is a good fix. Also, the change to the handler happens after the warmup, so the g3log handler is still be active during the compliation phase when we would want to have the low-level stack trace if there is a crash.

@wallashss
Copy link
Collaborator Author

Thanks @joerunde for the feedback and suggestions. Pretty cool ideas.

However, if we think that the one set by the g3log is not required for proper shutdown, then we should probably just override it to the vllm one and call it a day

I guess it is more like that. What I observed is that g3log is overriding the signals, the original signal handler of vllm is completely overwritten and it is never called, so I pretty sure that would do to any signals as well... I wrote this PR more thinking as workaround than an ideal solution. I think an ideal solution would be setup g3log to not override other signals. Unfortunately, I did not saw in g3log docs or code ways to setup the signal handler indirectly like using environment variables. It looks like this should be done by the application itself.

While I appreciate your solution, I don't think we'll have extra benefits, because g3log have already clean up everything.

@wallashss
Copy link
Collaborator Author

@tjohnson31415 just share some thoughts at the same time as me 😄 . Thanks!

# vLLM V1 + torch_sendnn backend to be able to gracefully
# shutdown the engine.
"VLLM_SPYRE_OVERRIDE_SIGNALS_HANDLER":
lambda: os.getenv("VLLM_SPYRE_OVERRIDE_SIGNALS_HANDLER", "0"),
Copy link
Collaborator

@tjohnson31415 tjohnson31415 Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently setting the value of VLLM_SPYRE_OVERRIDE_SIGNALS_HANDLER to the string "0" which is truthy, and it would require an empty env var for a falsey value.

Also, the default should be to have a graceful shutdown, i.e. enable the override.

Suggested change
lambda: os.getenv("VLLM_SPYRE_OVERRIDE_SIGNALS_HANDLER", "0"),
lambda: bool(int(os.getenv("VLLM_SPYRE_OVERRIDE_SIGNALS_HANDLER", "1"))),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

I set first disabled as default, because the libraries that installed g3log might need it to dump their log. However, there are other signals that we do not touch that might happen in runtime errors. So for the SIGTERM and SIGINT, which are for graceful shutdown, the override would be fine.

Co-authored-by: Travis Johnson <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
@wallashss wallashss force-pushed the wallas-graceful-shutdown branch from 18ff336 to 65664f9 Compare June 3, 2025 19:46
Copy link
Collaborator

@tjohnson31415 tjohnson31415 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR took me down a bit of a rabbit hole on signal handling 😅

Great fix though. LGTM

@wallashss wallashss enabled auto-merge (squash) June 3, 2025 20:06
@joerunde
Copy link
Collaborator

joerunde commented Jun 3, 2025

Ah, yeah TIL about the C++ vs Python signal handling 🤦

LGTM!

@github-actions github-actions bot added the ready label Jun 3, 2025
@wallashss wallashss merged commit 2d2bae2 into main Jun 3, 2025
22 checks passed
@wallashss wallashss deleted the wallas-graceful-shutdown branch June 3, 2025 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants