-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
cgroups: reapp child processes before destroying cgroup #13135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
killing is enough to destroy a cgroup, we need to remove the wait state of zombie processes before we can destroy the cgroup.
Tested this on a busy CI machine a bit and didn't any breakage yet that we saw before in nix-community and NixOS infra. |
@@ -101,6 +102,11 @@ static CgroupStats destroyCgroup(const std::filesystem::path & cgroup, bool retu | |||
// FIXME: pid wraparound | |||
if (kill(pid, SIGKILL) == -1 && errno != ESRCH) | |||
throw SysError("killing member %d of cgroup '%s'", pid, cgroup); | |||
|
|||
while (waitpid(pid, nullptr, 0) == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it matters, but I think this only works for processes that are direct children of the current process? For others it will return ECHILD
according to the manpage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also seems like it doesn't work correctly because there some other code that uses waitpid and crashes if there is no wait result.
My current approach causes this crash:
|
Not sure I understand the problem. Zombie processes should be reaped by |
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/2025-05-07-nix-team-meeting-minutes-224/63973/1 |
Are they not parent of nix-daemon still? At least I am able to call waitpid() on some of them. |
Woudn't be the right approach to wait for processes with parents in the cgroup first? Of course before doing this The question is whether it's guaranteed that other processes will wait for the killed processes with parents not in the cgroup immediately. Maybe waiting for these processed should be tried after some timeout. |
It probably improve but it also looks like some of the parents are actually other nix-daemon processes, at least looking at the stacktrace. I will try to add more logging to understand better what is going on here. |
Do you know whether writeFile(killFile, "1"); is executed? |
Could you explain why it is problematic to kill processes out of order? It is not obvious to me when we are going to clear the whole process tree anyhow, why need to follow a certain order. |
The order of killing and waiting matters. Let's assume a child is killed while its parent is still running, which can happen when |
killing is enough to destroy a cgroup, we need to remove the wait state of zombie processes before we can destroy the cgroup.
Motivation
Context
Add 👍 to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.