metrics: add tini to prevent zombie processes in metrics-exporter #3378

weirdwiz · 2025-07-18T07:06:56Z

The ocs-metrics-exporter creates zombie processes when executing ceph and rbd commands to collect metrics. While we use CombinedOutput() which should wait for process completion, zombies are still accumulating in some customer environments. Adding tini as a minimal init system (via the ENTRYPOINT) ensures all zombie processes are properly reaped regardless of how they're created.

https://issues.redhat.com/browse/DFBUGS-2875

openshift-ci · 2025-07-18T07:07:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: weirdwiz
Once this PR has been reviewed and has the lgtm label, please assign obnoxxx for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

weirdwiz · 2025-07-18T07:08:06Z

ptal @umangachapagain @aruniiird

weirdwiz · 2025-07-18T07:08:52Z

docker's --init flag also uses tini as an init process to handle zombies

The metrics-exporter spawns ceph and rbd commands to collect metrics. Without a proper init system, these child processes can become zombies in certain container environments. Adding tini as PID 1 ensures all zombie processes are properly reaped. Signed-off-by: Divyansh Kamboj <[email protected]>

weirdwiz · 2025-07-18T07:58:01Z

/cherry-pick release-4.18

openshift-cherrypick-robot · 2025-07-18T07:58:05Z

@weirdwiz: once the present PR merges, I will cherry-pick it on top of release-4.18 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.18 release-4.19 release-4.17 release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

weirdwiz · 2025-07-18T07:58:28Z

/cherry-pick release-4.19

openshift-cherrypick-robot · 2025-07-18T07:58:31Z

@weirdwiz: once the present PR merges, I will cherry-pick it on top of release-4.19 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

weirdwiz · 2025-07-18T07:58:48Z

/cherry-pick release-4.16
/cherry-pick release-4.17

openshift-cherrypick-robot · 2025-07-18T07:58:51Z

@weirdwiz: once the present PR merges, I will cherry-pick it on top of release-4.16, release-4.17 in new PRs and assign them to you.

In response to this:

/cherry-pick release-4.16
/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

idryomov

While we use CombinedOutput() which should wait for process completion, zombies are still accumulating in some customer environments.

Do you have any details as to what those customer environments look like? I noticed that metrics-exporter sets a 30 second timeout for all execCommand() invocations -- is there evidence of those timeouts expiring?

func execCommand(command string, args []string, timeout int) ([]byte, error) {
	var cancel context.CancelFunc
	ctx, cancel := context.WithTimeout(context.Background(), time.Duration(timeout)*time.Second)
	defer cancel()
	cmd := exec.CommandContext(ctx, command, args...)

	output, err := cmd.CombinedOutput()
        ...

I'm asking because when WithTimeout is applied to CommandContext, SIGKILL is sent only to that particular process. If command happens to spawn child processes, they normally remain unaffected.

BlaineEXE · 2025-07-21T17:14:55Z

metrics/Dockerfile

+RUN dnf install -y tini && \
+    dnf clean all


The ceph containers built by RHCS don't have dnf available. They use microdnf instead, and tini isn't an installable package. This might break when using RHCS images unless the build team has a different dockerfile for this

As an alternative, you could consider installing tini-static (tini built as a static binary) in a new builder stage and then copying /usr/bin/tini-static to the final container from the build stage. That still would have some challenges coordinating with the build team tho

Before going through the motions with tini, I'd suggest taking a harder look at whether there is a need to have a process reaper in place at all. If the issue occurs only when the timeout expires, it would likely be easier to fix the respective code in execCommand().

leelavg · 2025-07-23T06:08:27Z

@weirdwiz how did you reproduce this issue? If the subprocess is becoming zombie and you are using tini only to reap them when they get re-parented, it maybe possible to just do cmd.Wait() on it (we can also disable sigchld but it may effect our whole parent process)?

As Ilya mentioned, if the subprocess is re-forking you may end up w/ orphan's. Not sure how this can be done in Go, usual way would be during exec do a setsid and kill the group (ie, negative pid) and wait, when done/timedout which will result in no zombie and no orphan.

So, if you can provide the steps on how you reproduced the issue & the leaks observed, it'll be very helpful.

thanks!

weirdwiz · 2025-07-24T13:31:50Z

@weirdwiz how did you reproduce this issue? If the subprocess is becoming zombie and you are using tini only to reap them when they get re-parented, it maybe possible to just do cmd.Wait() on it (we can also disable sigchld but it may effect our whole parent process)?

I was not able to reproduce, it completely in a dev env. And tested out the reap with manually creating zombies.

I also suspect like @idryomov that sub process is re-forking and we're exiting out, but there was not enough hard evidence in the logs (the process exited out with -1), so I thought that the safest way to make sure we don't run into it is to setup tini

As Ilya mentioned, if the subprocess is re-forking you may end up w/ orphan's. Not sure how this can be done in Go, usual way would be during exec do a setsid and kill the group (ie, negative pid) and wait, when done/timedout which will result in no zombie and no orphan.

i think we can set cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true} to also send SIGKILL to the reforked processes

weirdwiz · 2025-09-17T08:40:28Z

tested out reproducing the issue with adding delays in the network, also forcing issues with credentials (forcing -1 exit code)
but haven't seen defunct processes created in the system,
tini will clean up zombies, that happen to show up

@leelavg @idryomov where do you think we should proceed with this issue?

idryomov · 2025-09-17T10:19:13Z

@weirdwiz Have you reproduced the timeout expiry scenario specifically?

weirdwiz · 2025-09-18T07:35:36Z

@weirdwiz Have you reproduced the timeout expiry scenario specifically?

yes, i've reproduced that in a cluster, but that didn't lead to zombies getting created

weirdwiz force-pushed the defunct branch from dac28d1 to 2413509 Compare July 18, 2025 07:53

weirdwiz force-pushed the defunct branch from 2413509 to 10b8834 Compare July 18, 2025 07:54

idryomov reviewed Jul 21, 2025

View reviewed changes

BlaineEXE reviewed Jul 21, 2025

View reviewed changes

metrics: add tini to prevent zombie processes in metrics-exporter #3378

Are you sure you want to change the base?

metrics: add tini to prevent zombie processes in metrics-exporter #3378

Uh oh!

Conversation

weirdwiz commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Jul 18, 2025

Uh oh!

weirdwiz commented Jul 18, 2025

Uh oh!

weirdwiz commented Jul 18, 2025

Uh oh!

weirdwiz commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-cherrypick-robot commented Jul 18, 2025

Uh oh!

weirdwiz commented Jul 18, 2025

Uh oh!

openshift-cherrypick-robot commented Jul 18, 2025

Uh oh!

weirdwiz commented Jul 18, 2025

Uh oh!

openshift-cherrypick-robot commented Jul 18, 2025

Uh oh!

idryomov left a comment

Choose a reason for hiding this comment

Uh oh!

BlaineEXE Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

BlaineEXE Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

idryomov Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

leelavg commented Jul 23, 2025

Uh oh!

weirdwiz commented Jul 24, 2025

Uh oh!

weirdwiz commented Sep 17, 2025

Uh oh!

idryomov commented Sep 17, 2025

Uh oh!

weirdwiz commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

weirdwiz commented Jul 18, 2025 •

edited

Loading

weirdwiz commented Jul 18, 2025 •

edited

Loading

BlaineEXE Jul 21, 2025 •

edited

Loading