Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch-to-configuration-ng: failed to restart sysinit-reactivation.target #378535

Closed
3 tasks done
diogotcorreia opened this issue Feb 1, 2025 · 6 comments · Fixed by #385997
Closed
3 tasks done

switch-to-configuration-ng: failed to restart sysinit-reactivation.target #378535

diogotcorreia opened this issue Feb 1, 2025 · 6 comments · Fixed by #385997
Labels
0.kind: bug Something is broken

Comments

@diogotcorreia
Copy link
Member

Nixpkgs version

  • Stable (24.11)

Describe the bug

I'm using system.autoUpgrade.enable (very recently changed to a custom module, but problem happened before) with a flake to upgrade my systems every night (this only happens on a specific system out of 5 that have this module enabled). However, sometimes (2-3 times a week), this unit fails on activation because of a timeout restarting sysinit-reactivation.target.

Here are the full logs for nixos-rebuild-switch-to-configuration.service and sysinit-reactivation.target:

Feb 01 04:08:42 hera systemd[1]: Starting /nix/store/4f8fpfwab9x8xcw1y1di7mmc6aszhxga-nixos-system-hera-24.11.20250126.4e96537/bin/switch-to-configuration switch...
Feb 01 04:08:42 hera systemd[1]: Started /nix/store/4f8fpfwab9x8xcw1y1di7mmc6aszhxga-nixos-system-hera-24.11.20250126.4e96537/bin/switch-to-configuration switch.
Feb 01 04:08:42 hera nixos-upgrade-start[628103]: Installing Lanzaboote to "/boot"...
Feb 01 04:08:42 hera nixos-upgrade-start[628103]: Collecting garbage...
Feb 01 04:08:42 hera nixos-upgrade-start[628103]: Successfully installed Lanzaboote.
Feb 01 04:08:43 hera nixos[628100]: switching to system configuration /nix/store/4f8fpfwab9x8xcw1y1di7mmc6aszhxga-nixos-system-hera-24.11.20250126.4e96537
Feb 01 04:08:43 hera nixos-upgrade-start[628100]: activating the configuration...
Feb 01 04:08:43 hera nixos-upgrade-start[628136]: [agenix] creating new generation in /run/agenix.d/6
Feb 01 04:08:43 hera nixos-upgrade-start[628136]: [agenix] decrypting secrets...
/// ...
Feb 01 04:08:43 hera nixos-upgrade-start[628136]: [agenix] symlinking new secrets to /run/agenix (generation 6)...
Feb 01 04:08:43 hera nixos-upgrade-start[628136]: [agenix] removing old secrets (generation 5)...
Feb 01 04:08:44 hera nixos-upgrade-start[628136]: [agenix] chowning...
Feb 01 04:08:44 hera nixos-upgrade-start[628136]: setting up /etc...
Feb 01 04:08:49 hera nixos-upgrade-start[628100]: restarting sysinit-reactivation.target
Feb 01 04:08:54 hera nixos-upgrade-start[628100]: Failed to restart sysinit-reactivation.target: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Feb 01 04:09:02 hera nixos-upgrade-start[628100]: the following new units were started: run-credentials-acme\x2dfirefly3\x2dcsv.hera.diogotc.com.service.mount
Feb 01 04:09:02 hera nixos[628100]: switching to system configuration /nix/store/4f8fpfwab9x8xcw1y1di7mmc6aszhxga-nixos-system-hera-24.11.20250126.4e96537 failed (status 4)
Feb 01 04:09:02 hera systemd[1]: nixos-rebuild-switch-to-configuration.service: Main process exited, code=exited, status=4/NOPERMISSION
Feb 01 04:09:02 hera systemd[1]: nixos-rebuild-switch-to-configuration.service: Failed with result 'exit-code'.
Feb 01 04:09:02 hera systemd[1]: nixos-rebuild-switch-to-configuration.service: Consumed 1.860s CPU time, 16.4M memory peak, 5.5M read from disk.
Feb 01 04:08:55 hera systemd[1]: Stopped target Reactivate sysinit units.
Feb 01 04:08:55 hera systemd[1]: Stopping Reactivate sysinit units...
Feb 01 04:08:55 hera systemd[1]: Reached target Reactivate sysinit units.

Looking at the timestamps, it seems like the activation script tries to restart sysinit-reactivation.target at 04:08:49, but it only does at 04:08:55. It appears that systemd doesn't reply through dbus or something and reaches the timeout of 5 seconds.
My theory is that the system is under load and something freezes, taking too long to run, but I have no way to verify this claim at the moment.

Steps to reproduce

Unfortunately I'm unsure how to reliably reproduce this. I'd appreciate some pointers on how to get some logs that could help figure out the root of the problem.

Expected behaviour

Activation should complete successfully

Screenshots

No response

Relevant log output

Additional context

Config: https://github.com/diogotcorreia/dotfiles/tree/c115fd5eb54875107079ffd343201d2945a8898c (host hera)

Maybe related to #313696

I'm going to try and increase the timeout from 5 seconds to 10 seconds and report back, but that might be a dirty fix instead of fixing the underlying issue.

System metadata

  • system: "x86_64-linux"
  • host os: Linux 6.6.72, NixOS, 24.11 (Vicuna), 24.11.20250126.4e96537
  • multi-user?: yes
  • sandbox: yes
  • version: nix-env (Nix) 2.24.11
  • nixpkgs: /nix/store/50yickar04m51aqnc43gxf45g2i0n3k9-source

Notify maintainers

@jmbaur


Note for maintainers: Please tag this issue in your pull request description. (i.e. Resolves #ISSUE.)

I assert that this issue is relevant for Nixpkgs

Is this issue important to you?

Add a 👍 reaction to issues you find important.

@diogotcorreia diogotcorreia added the 0.kind: bug Something is broken label Feb 1, 2025
diogotcorreia added a commit to diogotcorreia/dotfiles that referenced this issue Feb 1, 2025
@arianvp
Copy link
Member

arianvp commented Feb 28, 2025

I'm running into the same issue every once in a while on our servers:

Failed to start remote-fs.target: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

sometimes it's multi-user.target sometimes remote-fs.target sometimes something else

@arianvp
Copy link
Member

arianvp commented Feb 28, 2025

ping @jmbaur . Seems to be OP worked around it by increasing timeouts in the rust implementation. Does that look sensible to you?

@diogotcorreia
Copy link
Member Author

Forgot to report back, but I confirm I haven't had this issue since I changed the timeout to 10s

@arianvp
Copy link
Member

arianvp commented Feb 28, 2025

Mind making a PR?

@jmbaur
Copy link
Contributor

jmbaur commented Feb 28, 2025

@arianvp @diogotcorreia seems logical to me. The original values were just reflected to what works for most cases, starting with the switch-test nixos VM test, so fine if we deviate from it to work in more scenarios. @diogotcorreia can you PR your change?

@diogotcorreia
Copy link
Member Author

@arianvp @jmbaur Yeah, I can make one tomorrow!

@arianvp arianvp closed this as completed in 5cc9347 Mar 3, 2025
nixpkgs-ci bot pushed a commit that referenced this issue Mar 3, 2025
In certain cases, systemd might take more than 5 seconds to reply
through dbus, causing the switch to appear to fail even though it
succeeded. This commit increases the timeout to 10 seconds, which should
make it more reliable. Additionally, the timeout for the login dbus was
also increased for consistency.

Fix #378535

(cherry picked from commit 5cc9347)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants