-
Notifications
You must be signed in to change notification settings - Fork 183
[9.0] Subprocess: use psutil for killing processes #8359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: integration
Are you sure you want to change the base?
Conversation
c5358e5 to
c6b2e6a
Compare
| process.send_signal(signal.SIGTERM) | ||
| except psutil.NoSuchProcess: | ||
| return ([], []) | ||
| return psutil.wait_procs([process]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a risk of waiting forever here?
Especially knowing that we had issue to finish the processes before:
From what I understand, Subprocess.killChild() logic fails to close the stdout/stderr pipes, so the ExecutionThread never returns and the JobWrapper spins forever.`
According to the documentation, setting a timeout could prevent that:
This function will return as soon as all processes terminate or when timeout (seconds) occurs. Differently from Process.wait() it will not raise TimeoutExpired if timeout occurs.`
https://psutil.readthedocs.io/en/latest/index.html#psutil.wait_procs
Could you try to add a timeout and see if it changes the behavior of the JobWrapper test for instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you introduce the timeout, it means that you can potentially have some alive threads, and here you would need to kill them for good.
| return psutil.wait_procs([process]) | |
| terminated_gone, terminated_alive = psutil.wait_procs([process], timeout=timeout) | |
| if terminated_alive: | |
| self.log.info(f"Process {process.pid} did not terminate after SIGTERM, escalating to SIGKILL") | |
| for p in terminated_alive: | |
| try: | |
| p.kill() | |
| except psutil.NoSuchProcess: | |
| pass | |
| killed_gone, killed_alive = psutil.wait_procs(terminated_alive, timeout=5) | |
| gone.extend(killed_gone) | |
| return (gone, killed_alive) | |
| return (gone, alive) |
You could potentially print additional details if a terminated + killed process is still alive after the timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then it looks like we are kind of back to what we had here actually: https://github.com/aldbr/DIRAC/blob/940f06c8dad164a50e12988a94ddd49e67d51aa2/src/DIRAC/Core/Utilities/Subprocess.py#L266-L295
May be we should print additional details to understand why certain processes are not killed as they should?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Actually the issue was likely caused by the first if condition, which is supposed to be True if we use start_new_session=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did these changes but the JobWrapper test did not change.
c6b2e6a to
8414d02
Compare
ec23535 to
72f644d
Compare
| for p in children: | ||
| g, a = self.__terminateProcess(p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually here you could directly call psutil.wait_procs as it was done before, no?
This is shown in the psutil documentation (https://psutil.readthedocs.io/en/latest/index.html#psutil.wait_procs) and this was done like this before (I think the else block was correct but it was just not used by the JobWrapper.
72f644d to
82e0985
Compare
This is a rewrite of few functions in subprocess that need very careful evaluation.
For reasons not understood (at least not by me) the logic of killing a pid and its sub-processes is not completing (not always, at least). I blame the very old logic used there, but of course I might be wrong. In any case this is a rewriting using
psutil.I had to change one test in Test_JobWrapper. I have not fully understood why, but I do not see how the previous code could work. cc @aldbr
BEGINRELEASENOTES
CHANGE: Subprocess: use psutil for killing processes
ENDRELEASENOTES