Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SP reset reporting needs work #1978

Open
cbiffle opened this issue Jan 16, 2025 · 0 comments
Open

SP reset reporting needs work #1978

cbiffle opened this issue Jan 16, 2025 · 0 comments

Comments

@cbiffle
Copy link
Collaborator

cbiffle commented Jan 16, 2025

In attempting to use the SP's "reset reason" feature in production for the first time, we've noticed that it has... issues.

  1. Because of the RoT's control of the reset line, the reset reason is almost always "Pin" in practice.
  2. This is because the reporting of the reset reason collapses various potential sources into just one, when in fact it should be a set of reasons.
  3. We also clear the reset reason immediately on boot, which means successive resets lose data. The hardware appears to accumulate reset reasons across reboots, so if we get a series of fast reboots we can learn at least some information about all of them. We probably want to make the clearing conditional or delayed.
  4. There is no way to actually ask a production image for the reset reason. It's only exposed in an unused Idol call, so people have been getting it through humility ipc ... which of course is disabled in prod.

My current perspective is that we should do the following:

  • Always report the raw hardware reset bits, alongside a bitset of platform-independent interpreted reasons. This way a curious engineer with a datasheet can map them back to hardware behaviors.
  • Only clear the reset reason when we have reason to believe we're stable. The hackiest way of doing this would be doing it, like, 60 seconds after boot or whatever. A better way would be having the control plane collect and clear the data.
  • Speaking of which, we need a way of getting that data out over the network. Could be hacked into something like gimlet-inspector for now but ought to be available in a more standardized form. Perhaps an ereport?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant