-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-73824] Wait for Pipeline builds to complete before allowing their jobs to be deleted #9790
Conversation
… to add a regression test in core
Moving back to draft until I have time to check 8cebb0f more thoroughly. |
// I don't know why, but we have to keep interrupting | ||
entry.getKey().interrupt(Result.ABORTED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this after some discussion in jenkinsci/workflow-job-plugin#468 (comment). The fact that we were repeatedly interrupting things here was a bit unusual. Other things in Jenkins that interrupt builds just call interrupt
a single time.
The extra interruptions were added in 047e849, but the test in that commit passes even without the extra interruptions. For Pipelines, which are the only jobs I know of where interrupt
may not kill the job promptly, repeatedly calling interrupt
is generally not going to help (barring things like poorly written try
/catch
blocks). You have to move to more severe methods such as doTerm
and doKill
(perhaps we should consider making Pipeline do this on its own) to guarantee interruption. Now that Pipelines are handled correctly and we don't always remove them from the iterator on the first iteration, the repeated interruptions every 50ms could produce a large number of timer tasks here for Pipelines that take a while to complete, whereas previously a single Pipeline would have been interrupted at most twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Pipelines, the main issue is that users are very easily able to inadvertently swallow interruptions just by adding a try
/catch
block to their Pipeline script. FlowInterruptedException.actualInterruption
makes it possible to avoid swallowing interruptions in trusted scripts and from Pipeline steps, but I would guess there are many steps that do not handle this correctly, and sandboxed scripts do not have access to this field by default.
For non-Pipelines, poorly written Builder
and Publisher
implementations may also swallow interruptions, but I have no idea how common this is.
What do other people think about this? I see 3 main options:
- Interrupt builds only once (as in this PR now) - if the build doesn't respond to interruption, someone will have to kill it some other way or retry deletion until the build finally dies. This is how interruption works if you try to cancel a build in the Jenkins UI.
- Interrupt builds repeatedly in a loop every 50ms until they die (as before this PR) - Best chance of interrupting the build, but is not aligned with other interruption mechanisms in Jenkins, and is still not guaranteed to cancel builds in all cases. In combination with the main fix here, this can lead to a large number of timer tasks being created when interrupting Pipeline builds, which seems undesirable. For example, if a build takes the full 15s to die, at 50ms per interrupt we will try to interrupt the build 300 times.
- Interrupt builds repeatedly, but at a much lower frequency, say once per second. This would make this code able to cancel builds that swallow a single interruption due to something simple like a
try
/catch
block in a Pipeline, but would avoid overloading the Pipeline interruption mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First option looks fine to me. Users will try to interrupt if they see nothing happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first option seems appropriate for purposes of this PR; probably WorkflowRun
(and/or CpsFlowExecution
) should be independently improved to enforce an exit from an actualInterruption
in a timely manner.
This PR is now ready for merge. We will merge it after approximately 24 hours if there is no negative feedback. /label ready-for-merge |
…their jobs to be deleted (jenkinsci#9790) * Wait for Pipelines to complete before allowing their jobs to be deleted * Create mock Job/Run classes that use AsynchronousExecution to be able to add a regression test in core * [JENKINS-73824] Do not repeatedly interrupt executables in ItemDeletion.cancelBuildsInProgress * [JENKINS-73824] Delete ItemDeletionTest based on jenkinsci#9790 (comment) (cherry picked from commit 4d7b993)
See JENKINS-73824. This might qualify as a major bug, although it has existed for a long time and is not a regression as far as I can see.
While investigating an issue with branch indexing for multibranch projects leaving build directories behind when deleting projects, I think I found a rather severe issue with job deletion in general for Pipelines. Since its introduction in #2789, the logic for cancelling ongoing builds and waiting for them to complete when deleting their parent job has checked
Thread.isAlive
rather thanExecutor.isActive
, which is not correct for asynchronous tasks such as the main Pipeline execution. See the Javadoc here. Oleg actually suggested changing toisActive
in the original PR here for other reasons.The result is that although ongoing Pipeline builds are interrupted,
ItemDeletion.cancelBuildsInProgress
does not wait for those builds to fully complete, which can at least lead to files being written back into the job directory that was just deleted. There are probably other more exotic issues possible for Pipelines that do not shut down quickly when interrupted.Testing done
Tested against jenkinsci/workflow-job-plugin#468 to check that it fixes the issue. I added a test here in 3e27ac6, but it required a lot of boilerplate and was not particularly realistic, so I removed it after discussion in #9790 (comment).
Proposed changelog entries
Proposed upgrade guidelines
N/A
Submitter checklist
Desired reviewers
@mention
Before the changes are marked as
ready-for-merge
:Maintainer checklist