-
Notifications
You must be signed in to change notification settings - Fork 207
cmd: harden main lifecycle and panic handling #1954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
cmd: harden main lifecycle and panic handling #1954
Conversation
Refactors `cmd/main.go` to use the main delegate pattern to ensure proper application lifecycle management. Changes: - Delegates logic to a `run()` function to ensure `defer` statements execute before `os.Exit` is called. - Adds top-level panic recovery to capture and log stack traces during initialization, aiding in debugging crash loops. - Initializes a bootstrap logger to capture startup errors before the controller runtime manager is fully configured.
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: LukeAVanDrie The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Hi @LukeAVanDrie. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
main.go was kept very "skimpy" on purpose, pushing almost everything into runner.go. This makes it more consumable by downstream projects, such as llm-d-inference-scheduler. If this functionality is important, it should be moved into runner/runner.go. Make runner.Run() invoke a helper that does most of the work here. Also note that the first thing runner.run() does is create a logger to be used during the setup of the system. |
| ctrl.SetLogger(bootstrapLog) | ||
|
|
||
| // Panic Recovery: This catches panics on the main goroutine during initialization. | ||
| // Note: It will NOT catch panics in child goroutines spawned by the Manager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching only some of the Panics is not very good. I think it should be all or nothing.
GIE code should not call Panic(). Errors should be returned as errors and not via calls to Panic().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching only some of the Panics is not very good. I think it should be all or nothing.
Capturing all runtime panics (from child goroutines) is the ideal end state. I was trying to keep this PR small in scope and focused solely on the boostrap failure modes.
GIE code should not call Panic(). Errors should be returned as errors and not via calls to Panic().
I agree that we should never use panic() for control flow or error handling. However, this recover block isn't intended to handle intentional panics. It is a safety net for unintentional bugs (e.g., nil pointer dereferences) during the bootstrap/initialization phase (and runtime if we address the point above).
Currently, if a panic occurs, the pod exits silently, making CrashLoopBackOff very difficult to debug. This change ensures we catch those failures and print a structured stack trace.
I personally have found this useful for debugging custom local builds when I am testing new features before sending out PRs. Ideally we never hit this case in production.
Good point! I will update the PR to move this hardening logic into a wrapper function inside runner so that consumers get the safety benefits without duplicating code. If you agree with the justification for this change, I am also happy to expand the scope of this PR to cover child goroutine coverage. Else, if you think this isn't warranted, I can drop this and just keep it in my own fork for debugging purposes. |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
This PR hardens the
cmd/main.goentrypoint by refactoring the application lifecycle to use a dedicatedrun()function.Reasoning:
os.Exitimmediately terminates the program, preventing pendingdeferstatements from executing. By delegating logic torun()and checking the error inmain, we ensure the stack unwinds and cleanups happen before the exit code is set.recover()handler. If the application panics during initialization (e.g., bad config, nil pointer), this ensures the stack trace is logged as a structured error rather than printing to stderr and exiting silently. This is critical for debugging K8s CrashLoopBackOff states.Which issue(s) this PR fixes:
N/A
Does this PR introduce a user-facing change?: