-
Notifications
You must be signed in to change notification settings - Fork 189
metrics: add tini to prevent zombie processes in metrics-exporter #3378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: weirdwiz The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
docker's |
The metrics-exporter spawns ceph and rbd commands to collect metrics. Without a proper init system, these child processes can become zombies in certain container environments. Adding tini as PID 1 ensures all zombie processes are properly reaped. Signed-off-by: Divyansh Kamboj <[email protected]>
|
/cherry-pick release-4.18 |
|
@weirdwiz: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cherry-pick release-4.19 |
|
@weirdwiz: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cherry-pick release-4.16 |
|
@weirdwiz: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
idryomov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we use
CombinedOutput()which should wait for process completion, zombies are still accumulating in some customer environments.
Do you have any details as to what those customer environments look like? I noticed that metrics-exporter sets a 30 second timeout for all execCommand() invocations -- is there evidence of those timeouts expiring?
func execCommand(command string, args []string, timeout int) ([]byte, error) {
var cancel context.CancelFunc
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(timeout)*time.Second)
defer cancel()
cmd := exec.CommandContext(ctx, command, args...)
output, err := cmd.CombinedOutput()
...
I'm asking because when WithTimeout is applied to CommandContext, SIGKILL is sent only to that particular process. If command happens to spawn child processes, they normally remain unaffected.
| RUN dnf install -y tini && \ | ||
| dnf clean all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ceph containers built by RHCS don't have dnf available. They use microdnf instead, and tini isn't an installable package. This might break when using RHCS images unless the build team has a different dockerfile for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an alternative, you could consider installing tini-static (tini built as a static binary) in a new builder stage and then copying /usr/bin/tini-static to the final container from the build stage. That still would have some challenges coordinating with the build team tho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before going through the motions with tini, I'd suggest taking a harder look at whether there is a need to have a process reaper in place at all. If the issue occurs only when the timeout expires, it would likely be easier to fix the respective code in execCommand().
|
@weirdwiz how did you reproduce this issue? If the subprocess is becoming zombie and you are using tini only to reap them when they get re-parented, it maybe possible to just do As Ilya mentioned, if the subprocess is re-forking you may end up w/ orphan's. Not sure how this can be done in Go, usual way would be during exec do a So, if you can provide the steps on how you reproduced the issue & the leaks observed, it'll be very helpful. thanks! |
I was not able to reproduce, it completely in a dev env. And tested out the reap with manually creating zombies. I also suspect like @idryomov that sub process is re-forking and we're exiting out, but there was not enough hard evidence in the logs (the process exited out with -1), so I thought that the safest way to make sure we don't run into it is to setup
i think we can set |
|
tested out reproducing the issue with adding delays in the network, also forcing issues with credentials (forcing -1 exit code) @leelavg @idryomov where do you think we should proceed with this issue? |
|
@weirdwiz Have you reproduced the timeout expiry scenario specifically? |
yes, i've reproduced that in a cluster, but that didn't lead to zombies getting created |
The ocs-metrics-exporter creates zombie processes when executing
cephandrbdcommands to collect metrics. While we useCombinedOutput()which should wait for process completion, zombies are still accumulating in some customer environments. Adding tini as a minimal init system (via the ENTRYPOINT) ensures all zombie processes are properly reaped regardless of how they're created.https://issues.redhat.com/browse/DFBUGS-2875