Skip to content

Conversation

@laurazard
Copy link
Member

@laurazard laurazard commented Oct 20, 2025

Pull Request

closes #11782

What? (description)

Add a forceful mode to the reboot sequence (talosctl reboot --mode force) that bypasses graceful userspace teardown.

Why? (reasoning)

In certain situations, Talos's shutdown/reboot sequence hangs while waiting for services/mounts to be gracefully stopped (see: #11775).


I left a few questions as TODOs, PTAL at them if you review! Mainly I'm wondering about *machine.RebootRequest vs. runtime.XxxOptions, and whether to error out if we don't get a *machine.RebootRequest or just silently continue (which is fine for this, since any hypothetical older client wouldn't support force-reboot anyway, and the default is "graceful reboot".

Acceptance

Please use the following checklist:

  • you linked an issue (if applicable)
  • you included tests (if applicable)
  • you ran conformance (make conformance)
  • you formatted your code (make fmt)
  • you linted your code (make lint)
  • you generated documentation (make docs)
  • you ran unit-tests (make unit-tests)

@laurazard laurazard requested a review from smira October 20, 2025 12:39
@laurazard laurazard self-assigned this Oct 20, 2025
@github-project-automation github-project-automation bot moved this to To Do in Planning Oct 20, 2025
@talos-bot talos-bot moved this from To Do to In Review in Planning Oct 20, 2025
@laurazard laurazard force-pushed the force-reboot branch 5 times, most recently from 8e9d12f to 7e280cf Compare October 20, 2025 12:59
if msg == nil {
log.Printf("event not supported: %s", typeURL)
// We haven't implemented the handling of this event yet.
return nil, ErrEventNotSupported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to log here, or can we just wrap the returned error with the typeURL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was going to ask about this – I'm not sure about in general what we tend to do in talosctl, and (from what I saw) logging seems kind of inconsistent – I had to add this to even figure out what was going wrong because I was just getting very generic error output.

I'm happy to change the error returned, (after checking whether we're using ErrEventNotSupported isn't being used as a sentinel anywhere/compared to directly).

(As a sidenote, I wish we had a talosctl --debug ... and debug level logging.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird to wrap ErrEventNotSupported with the typeURL since the typeURL is a part of why that error happened, so I pushed a commit changing ErrEventNotSupported into a

type ErrEventNotSupported struct {
  TypeURL string
}
...

However, in practice we can't/shouldn't do this outside of a major version, since it changes a public API (hence me leaving it separate). In the meantime, we can either keep it as-is, or wrap it like suggested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't realize having an extra commit meant CI would complain/tests won't run 😭 I'll remove it then and just skip it for now.

@github-project-automation github-project-automation bot moved this from In Review to Approved in Planning Oct 21, 2025
@laurazard laurazard force-pushed the force-reboot branch 2 times, most recently from 30094cd to 42ef9d9 Compare October 21, 2025 11:44
@smira
Copy link
Member

smira commented Oct 21, 2025

Looks like "regular" reboot is broken now (I was looking at integration-qemu logs), @laurazard please let me know if you need help to troubleshoot/read logs.

@laurazard laurazard force-pushed the force-reboot branch 3 times, most recently from f13d584 to 7a423e6 Compare October 28, 2025 10:34
smira and others added 2 commits October 28, 2025 14:43
It might restart kube-apiserver (due to cert change) at a random moment.

Signed-off-by: Andrey Smirnov <[email protected]>
In certain situations, Talos's shutdown/reboot sequence hangs while
waiting for services/mounts to be gracefully stopped (see:
siderolabs#11775).

This patch adds a forceful mode to the reboot sequence (`talosctl reboot
--mode force`) that bypasses graceful userspace teardown and hard
reboots the machine.

Signed-off-by: Laura Brehm <[email protected]>
`client.ErrEventNotSupported` was a simple sentinel with no information.

Replaced it with `client.EventNotSupportedError`, a struct implementing
error with the offending TypeURL included.

Signed-off-by: Laura Brehm <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Approved

Development

Successfully merging this pull request may close these issues.

support "panic" reboot mode

5 participants