cmd: harden main lifecycle and panic handling #1954

LukeAVanDrie · 2025-12-05T05:33:30Z

What type of PR is this?
/kind cleanup

What this PR does / why we need it:
This PR hardens the cmd/main.go entrypoint by refactoring the application lifecycle to use a dedicated run() function.

Reasoning:

Defer Execution: In Go, calling os.Exit immediately terminates the program, preventing pending defer statements from executing. By delegating logic to run() and checking the error in main, we ensure the stack unwinds and cleanups happen before the exit code is set.
Panic Recovery: Adds a top-level recover() handler. If the application panics during initialization (e.g., bad config, nil pointer), this ensures the stack trace is logged as a structured error rather than printing to stderr and exiting silently. This is critical for debugging K8s CrashLoopBackOff states.
Bootstrap Logging: Initializes a temporary logger to ensure that errors occurring before the Controller Manager starts are captured and formatted correctly.

Which issue(s) this PR fixes:
N/A

Does this PR introduce a user-facing change?:

NONE

Refactors `cmd/main.go` to use the main delegate pattern to ensure proper application lifecycle management. Changes: - Delegates logic to a `run()` function to ensure `defer` statements execute before `os.Exit` is called. - Adds top-level panic recovery to capture and log stack traces during initialization, aiding in debugging crash loops. - Initializes a bootstrap logger to capture startup errors before the controller runtime manager is fully configured.

netlify · 2025-12-05T05:33:36Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`4472afc`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/69326ead2768ee00085ff879
😎 Deploy Preview	https://deploy-preview-1954--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-12-05T05:33:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LukeAVanDrie
Once this PR has been reviewed and has the lgtm label, please assign nirrozenbaum for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-12-05T05:33:39Z

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

nirrozenbaum · 2025-12-05T07:07:57Z

/ok-to-test

shmuelk · 2025-12-11T16:15:08Z

main.go was kept very "skimpy" on purpose, pushing almost everything into runner.go. This makes it more consumable by downstream projects, such as llm-d-inference-scheduler.

If this functionality is important, it should be moved into runner/runner.go. Make runner.Run() invoke a helper that does most of the work here. Also note that the first thing runner.run() does is create a logger to be used during the setup of the system.

shmuelk · 2025-12-11T16:16:47Z

cmd/epp/main.go

+	ctrl.SetLogger(bootstrapLog)
+
+	// Panic Recovery: This catches panics on the main goroutine during initialization.
+	// Note: It will NOT catch panics in child goroutines spawned by the Manager.


Catching only some of the Panics is not very good. I think it should be all or nothing.

GIE code should not call Panic(). Errors should be returned as errors and not via calls to Panic().

Catching only some of the Panics is not very good. I think it should be all or nothing.

Capturing all runtime panics (from child goroutines) is the ideal end state. I was trying to keep this PR small in scope and focused solely on the boostrap failure modes.

GIE code should not call Panic(). Errors should be returned as errors and not via calls to Panic().

I agree that we should never use panic() for control flow or error handling. However, this recover block isn't intended to handle intentional panics. It is a safety net for unintentional bugs (e.g., nil pointer dereferences) during the bootstrap/initialization phase (and runtime if we address the point above).

Currently, if a panic occurs, the pod exits silently, making CrashLoopBackOff very difficult to debug. This change ensures we catch those failures and print a structured stack trace.

I personally have found this useful for debugging custom local builds when I am testing new features before sending out PRs. Ideally we never hit this case in production.

LukeAVanDrie · 2025-12-11T19:35:53Z

If this functionality is important, it should be moved into runner/runner.go. Make runner.Run() invoke a helper that does most of the work here. Also note that the first thing runner.run() does is create a logger to be used during the setup of the system.

Good point! I will update the PR to move this hardening logic into a wrapper function inside runner so that consumers get the safety benefits without duplicating code. If you agree with the justification for this change, I am also happy to expand the scope of this PR to cover child goroutine coverage.

Else, if you think this isn't warranted, I can drop this and just keep it in my own fork for debugging purposes.

k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Dec 5, 2025

k8s-ci-robot requested review from liu-cong and shmuelk December 5, 2025 05:33

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 5, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 5, 2025

shmuelk reviewed Dec 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cmd: harden main lifecycle and panic handling #1954

cmd: harden main lifecycle and panic handling #1954

LukeAVanDrie commented Dec 5, 2025

Uh oh!

netlify bot commented Dec 5, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Dec 5, 2025

Uh oh!

k8s-ci-robot commented Dec 5, 2025

Uh oh!

nirrozenbaum commented Dec 5, 2025

Uh oh!

shmuelk commented Dec 11, 2025

Uh oh!

shmuelk Dec 11, 2025

Uh oh!

LukeAVanDrie Dec 11, 2025

Uh oh!

LukeAVanDrie commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cmd: harden main lifecycle and panic handling #1954

Are you sure you want to change the base?

cmd: harden main lifecycle and panic handling #1954

Conversation

LukeAVanDrie commented Dec 5, 2025

Uh oh!

netlify bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Dec 5, 2025

Uh oh!

k8s-ci-robot commented Dec 5, 2025

Uh oh!

nirrozenbaum commented Dec 5, 2025

Uh oh!

shmuelk commented Dec 11, 2025

Uh oh!

shmuelk Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

LukeAVanDrie Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

LukeAVanDrie commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

netlify bot commented Dec 5, 2025 •

edited

Loading