Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-73824] Wait for Pipeline builds to complete before allowing their jobs to be deleted #9790

Merged
merged 4 commits into from
Oct 6, 2024

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Sep 26, 2024

See JENKINS-73824. This might qualify as a major bug, although it has existed for a long time and is not a regression as far as I can see.

While investigating an issue with branch indexing for multibranch projects leaving build directories behind when deleting projects, I think I found a rather severe issue with job deletion in general for Pipelines. Since its introduction in #2789, the logic for cancelling ongoing builds and waiting for them to complete when deleting their parent job has checked Thread.isAlive rather than Executor.isActive, which is not correct for asynchronous tasks such as the main Pipeline execution. See the Javadoc here. Oleg actually suggested changing to isActive in the original PR here for other reasons.

The result is that although ongoing Pipeline builds are interrupted, ItemDeletion.cancelBuildsInProgress does not wait for those builds to fully complete, which can at least lead to files being written back into the job directory that was just deleted. There are probably other more exotic issues possible for Pipelines that do not shut down quickly when interrupted.

Testing done

Tested against jenkinsci/workflow-job-plugin#468 to check that it fixes the issue. I added a test here in 3e27ac6, but it required a lot of boilerplate and was not particularly realistic, so I removed it after discussion in #9790 (comment).

Proposed changelog entries

  • Wait for ongoing Pipeline builds to fully complete before allowing their parent job to be deleted.

Proposed upgrade guidelines

N/A

Submitter checklist

Preview Give feedback

Desired reviewers

@mention

Before the changes are marked as ready-for-merge:

Maintainer checklist

Preview Give feedback

@dwnusbaum dwnusbaum changed the title Wait for Pipelines to complete before allowing their jobs to be deleted Wait for Pipeline builds to complete before allowing their jobs to be deleted Sep 26, 2024
@dwnusbaum dwnusbaum added the bug For changelog: Minor bug. Will be listed after features label Sep 26, 2024
@dwnusbaum dwnusbaum changed the title Wait for Pipeline builds to complete before allowing their jobs to be deleted [JENKINS-73824] Wait for Pipeline builds to complete before allowing their jobs to be deleted Sep 26, 2024
@dwnusbaum dwnusbaum marked this pull request as ready for review September 26, 2024 20:05
@dwnusbaum dwnusbaum requested a review from jglick September 26, 2024 20:05
@dwnusbaum
Copy link
Member Author

Moving back to draft until I have time to check 8cebb0f more thoroughly.

@dwnusbaum dwnusbaum marked this pull request as draft September 26, 2024 22:17
Comment on lines -273 to -274
// I don't know why, but we have to keep interrupting
entry.getKey().interrupt(Result.ABORTED);
Copy link
Member Author

@dwnusbaum dwnusbaum Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this after some discussion in jenkinsci/workflow-job-plugin#468 (comment). The fact that we were repeatedly interrupting things here was a bit unusual. Other things in Jenkins that interrupt builds just call interrupt a single time.

The extra interruptions were added in 047e849, but the test in that commit passes even without the extra interruptions. For Pipelines, which are the only jobs I know of where interrupt may not kill the job promptly, repeatedly calling interrupt is generally not going to help (barring things like poorly written try/catch blocks). You have to move to more severe methods such as doTerm and doKill (perhaps we should consider making Pipeline do this on its own) to guarantee interruption. Now that Pipelines are handled correctly and we don't always remove them from the iterator on the first iteration, the repeated interruptions every 50ms could produce a large number of timer tasks here for Pipelines that take a while to complete, whereas previously a single Pipeline would have been interrupted at most twice.

Copy link
Member Author

@dwnusbaum dwnusbaum Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Pipelines, the main issue is that users are very easily able to inadvertently swallow interruptions just by adding a try/catch block to their Pipeline script. FlowInterruptedException.actualInterruption makes it possible to avoid swallowing interruptions in trusted scripts and from Pipeline steps, but I would guess there are many steps that do not handle this correctly, and sandboxed scripts do not have access to this field by default.

For non-Pipelines, poorly written Builder and Publisher implementations may also swallow interruptions, but I have no idea how common this is.

What do other people think about this? I see 3 main options:

  • Interrupt builds only once (as in this PR now) - if the build doesn't respond to interruption, someone will have to kill it some other way or retry deletion until the build finally dies. This is how interruption works if you try to cancel a build in the Jenkins UI.
  • Interrupt builds repeatedly in a loop every 50ms until they die (as before this PR) - Best chance of interrupting the build, but is not aligned with other interruption mechanisms in Jenkins, and is still not guaranteed to cancel builds in all cases. In combination with the main fix here, this can lead to a large number of timer tasks being created when interrupting Pipeline builds, which seems undesirable. For example, if a build takes the full 15s to die, at 50ms per interrupt we will try to interrupt the build 300 times.
  • Interrupt builds repeatedly, but at a much lower frequency, say once per second. This would make this code able to cancel builds that swallow a single interruption due to something simple like a try/catch block in a Pipeline, but would avoid overloading the Pipeline interruption mechanism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First option looks fine to me. Users will try to interrupt if they see nothing happens.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first option seems appropriate for purposes of this PR; probably WorkflowRun (and/or CpsFlowExecution) should be independently improved to enforce an exit from an actualInterruption in a timely manner.

@dwnusbaum dwnusbaum marked this pull request as ready for review September 27, 2024 20:24
@dwnusbaum dwnusbaum requested a review from jglick September 27, 2024 20:31
@MarkEWaite
Copy link
Contributor

This PR is now ready for merge. We will merge it after approximately 24 hours if there is no negative feedback.

/label ready-for-merge

@comment-ops-bot comment-ops-bot bot added the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label Oct 5, 2024
@MarkEWaite MarkEWaite merged commit 4d7b993 into jenkinsci:master Oct 6, 2024
16 checks passed
@dwnusbaum dwnusbaum deleted the pipeline-job-deletion branch October 8, 2024 13:25
MarkEWaite pushed a commit to MarkEWaite/jenkins that referenced this pull request Oct 12, 2024
…their jobs to be deleted (jenkinsci#9790)

* Wait for Pipelines to complete before allowing their jobs to be deleted

* Create mock Job/Run classes that use AsynchronousExecution to be able to add a regression test in core

* [JENKINS-73824] Do not repeatedly interrupt executables in ItemDeletion.cancelBuildsInProgress

* [JENKINS-73824] Delete ItemDeletionTest based on jenkinsci#9790 (comment)

(cherry picked from commit 4d7b993)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For changelog: Minor bug. Will be listed after features ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants