Skip to content

Conversation

@LukeAVanDrie
Copy link
Contributor

What type of PR is this?
/kind cleanup

What this PR does / why we need it:
This PR hardens the cmd/main.go entrypoint by refactoring the application lifecycle to use a dedicated run() function.

Reasoning:

  1. Defer Execution: In Go, calling os.Exit immediately terminates the program, preventing pending defer statements from executing. By delegating logic to run() and checking the error in main, we ensure the stack unwinds and cleanups happen before the exit code is set.
  2. Panic Recovery: Adds a top-level recover() handler. If the application panics during initialization (e.g., bad config, nil pointer), this ensures the stack trace is logged as a structured error rather than printing to stderr and exiting silently. This is critical for debugging K8s CrashLoopBackOff states.
  3. Bootstrap Logging: Initializes a temporary logger to ensure that errors occurring before the Controller Manager starts are captured and formatted correctly.

Which issue(s) this PR fixes:
N/A

Does this PR introduce a user-facing change?:

NONE

Refactors `cmd/main.go` to use the main delegate pattern to ensure
proper application lifecycle management.

Changes:
- Delegates logic to a `run()` function to ensure `defer` statements
  execute before `os.Exit` is called.
- Adds top-level panic recovery to capture and log stack traces during
  initialization, aiding in debugging crash loops.
- Initializes a bootstrap logger to capture startup errors before the
  controller runtime manager is fully configured.
@k8s-ci-robot k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Dec 5, 2025
@netlify
Copy link

netlify bot commented Dec 5, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 4472afc
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/69326ead2768ee00085ff879
😎 Deploy Preview https://deploy-preview-1954--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LukeAVanDrie
Once this PR has been reviewed and has the lgtm label, please assign nirrozenbaum for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 5, 2025
@nirrozenbaum
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 5, 2025
@shmuelk
Copy link
Contributor

shmuelk commented Dec 11, 2025

main.go was kept very "skimpy" on purpose, pushing almost everything into runner.go. This makes it more consumable by downstream projects, such as llm-d-inference-scheduler.

If this functionality is important, it should be moved into runner/runner.go. Make runner.Run() invoke a helper that does most of the work here. Also note that the first thing runner.run() does is create a logger to be used during the setup of the system.

ctrl.SetLogger(bootstrapLog)

// Panic Recovery: This catches panics on the main goroutine during initialization.
// Note: It will NOT catch panics in child goroutines spawned by the Manager.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching only some of the Panics is not very good. I think it should be all or nothing.

GIE code should not call Panic(). Errors should be returned as errors and not via calls to Panic().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching only some of the Panics is not very good. I think it should be all or nothing.

Capturing all runtime panics (from child goroutines) is the ideal end state. I was trying to keep this PR small in scope and focused solely on the boostrap failure modes.

GIE code should not call Panic(). Errors should be returned as errors and not via calls to Panic().

I agree that we should never use panic() for control flow or error handling. However, this recover block isn't intended to handle intentional panics. It is a safety net for unintentional bugs (e.g., nil pointer dereferences) during the bootstrap/initialization phase (and runtime if we address the point above).

Currently, if a panic occurs, the pod exits silently, making CrashLoopBackOff very difficult to debug. This change ensures we catch those failures and print a structured stack trace.

I personally have found this useful for debugging custom local builds when I am testing new features before sending out PRs. Ideally we never hit this case in production.

@LukeAVanDrie
Copy link
Contributor Author

If this functionality is important, it should be moved into runner/runner.go. Make runner.Run() invoke a helper that does most of the work here. Also note that the first thing runner.run() does is create a logger to be used during the setup of the system.

Good point! I will update the PR to move this hardening logic into a wrapper function inside runner so that consumers get the safety benefits without duplicating code. If you agree with the justification for this change, I am also happy to expand the scope of this PR to cover child goroutine coverage.

Else, if you think this isn't warranted, I can drop this and just keep it in my own fork for debugging purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants