Skip to content

Conversation

@fstagni
Copy link
Contributor

@fstagni fstagni commented Oct 29, 2025

This is a rewrite of few functions in subprocess that need very careful evaluation.

For reasons not understood (at least not by me) the logic of killing a pid and its sub-processes is not completing (not always, at least). I blame the very old logic used there, but of course I might be wrong. In any case this is a rewriting using psutil.

I had to change one test in Test_JobWrapper. I have not fully understood why, but I do not see how the previous code could work. cc @aldbr

BEGINRELEASENOTES

CHANGE: Subprocess: use psutil for killing processes

ENDRELEASENOTES

process.send_signal(signal.SIGTERM)
except psutil.NoSuchProcess:
return ([], [])
return psutil.wait_procs([process])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a risk of waiting forever here?
Especially knowing that we had issue to finish the processes before:

From what I understand, Subprocess.killChild() logic fails to close the stdout/stderr pipes, so the ExecutionThread never returns and the JobWrapper spins forever.`

#8179

According to the documentation, setting a timeout could prevent that:

This function will return as soon as all processes terminate or when timeout (seconds) occurs. Differently from Process.wait() it will not raise TimeoutExpired if timeout occurs.`

https://psutil.readthedocs.io/en/latest/index.html#psutil.wait_procs

Could you try to add a timeout and see if it changes the behavior of the JobWrapper test for instance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you introduce the timeout, it means that you can potentially have some alive threads, and here you would need to kill them for good.

Suggested change
return psutil.wait_procs([process])
terminated_gone, terminated_alive = psutil.wait_procs([process], timeout=timeout)
if terminated_alive:
self.log.info(f"Process {process.pid} did not terminate after SIGTERM, escalating to SIGKILL")
for p in terminated_alive:
try:
p.kill()
except psutil.NoSuchProcess:
pass
killed_gone, killed_alive = psutil.wait_procs(terminated_alive, timeout=5)
gone.extend(killed_gone)
return (gone, killed_alive)
return (gone, alive)

You could potentially print additional details if a terminated + killed process is still alive after the timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then it looks like we are kind of back to what we had here actually: https://github.com/aldbr/DIRAC/blob/940f06c8dad164a50e12988a94ddd49e67d51aa2/src/DIRAC/Core/Utilities/Subprocess.py#L266-L295

May be we should print additional details to understand why certain processes are not killed as they should?

Copy link
Contributor

@aldbr aldbr Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Actually the issue was likely caused by the first if condition, which is supposed to be True if we use start_new_session=True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did these changes but the JobWrapper test did not change.

@fstagni fstagni force-pushed the 90_use_psutil branch 2 times, most recently from ec23535 to 72f644d Compare October 30, 2025 16:59
for p in children:
g, a = self.__terminateProcess(p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually here you could directly call psutil.wait_procs as it was done before, no?
This is shown in the psutil documentation (https://psutil.readthedocs.io/en/latest/index.html#psutil.wait_procs) and this was done like this before (I think the else block was correct but it was just not used by the JobWrapper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants