Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly cancel repository downloads #23837

Conversation

moroten
Copy link
Contributor

@moroten moroten commented Oct 1, 2024

When cancelling an asynchronous repository download task, an interrupt signal is sent to the download thread. This doesn't mean that the download stops immediately. Avoid restarting a download until the previous download has actually stopped, so that the new download is able to clean old data without crashing (on Windows).

Fixes #21773

@github-actions github-actions bot added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. awaiting-review PR is awaiting review from an assigned reviewer labels Oct 1, 2024
@moroten
Copy link
Contributor Author

moroten commented Oct 1, 2024

The profile below shows how this fix prevents the pending download Future to return until the cancellation is done.
cancelling-download-profile

@fmeum
Copy link
Collaborator

fmeum commented Oct 2, 2024

#21773 (comment)

@fmeum fmeum self-requested a review October 2, 2024 13:17
@fmeum
Copy link
Collaborator

fmeum commented Oct 10, 2024

@moroten I sent meroton#1 since that's more efficient than suggesting changes inline. Let me know what you think. You may also have to rebase this PR onto master to get CI unstuck.

moroten and others added 2 commits October 10, 2024 11:29
When cancelling an asynchronous repository download task, an interrupt
signal is sent to the download thread. This doesn't mean that the
download stops immediately. Avoid restarting a download until the
previous download has actually stopped, so that the new download is able
to clean old data without crashing (on Windows).

Fixes bazelbuild#21773
@moroten moroten force-pushed the pr/properly-cancel-repository-downloads branch from b3d15df to 1b42adc Compare October 10, 2024 09:29
@moroten
Copy link
Contributor Author

moroten commented Oct 10, 2024

@moroten I sent meroton#1 since that's more efficient than suggesting changes inline. Let me know what you think. You may also have to rebase this PR onto master to get CI unstuck.

Thank you @fmeum. After rebasing (and fixing a double space in a comment for you), I ran the test on bb-storage again as in #21773 and can see the new "Cancelling download " profiling point appear, which is great.

This version haven't been tested in production yet. Let me know if you would like that before merging, it may take a week or two.

Copy link
Collaborator

@fmeum fmeum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wyverald Could you review this as well?

@fmeum
Copy link
Collaborator

fmeum commented Oct 10, 2024

This version haven't been tested in production yet. Let me know if you would like that before merging, it may take a week or two.

That would probably mean missing the Bazel 8.0.0 and 7.4.0 releases. I'm not sure whether a confirmed race is better or worse than an unconfirmed fix.

@meteorcloudy What do you think?

@moroten
Copy link
Contributor Author

moroten commented Oct 10, 2024

The first commit has been verified in production, it's just the second commit from today that has not been tested on scale.

@fmeum
Copy link
Collaborator

fmeum commented Oct 10, 2024

The only meaningful difference is that I made the cleanup uninterruptible. Since it should only be interrupted when the user interrupts the build, it shouldn't make a big difference - it just requires the user to explicitly choose to kill the server in order to interrupt the cleanup.

@meteorcloudy
Copy link
Member

Hmm, I don't have a strong opinion, sounds like this could be a nice-to-have and could wait for 8.1?

Copy link
Member

@Wyverald Wyverald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! I think this is fine to cherry-pick into 8.0.0rc2 or so, but probably not going to make it in 7.4.0.

@@ -138,6 +140,8 @@ public Future<Path> startDownload(
eventHandler,
clientEnv,
context);
} finally {
doneSignal.countDown();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could lead to a deadlock if the future is never scheduled before being canceled (that is, after submit() is called and returns, but before the Callable gets called).

WorkerSkyKeyComputeState works around this by using a ListeningExecutorService and using addListener on the returned Future (https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/skyframe/WorkerSkyKeyComputeState.java;drc=23e5ec157d582585bb9635b5c374ec21cc50b4ea;l=173). let's do the same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point, but I verified, by both code inspection and empirically, that the Future.addListener also triggers its runnable too early. Therefore, I had to add a semaphore to get the ordering right, which my small scale experiment confirms work. (Well, I could not trigger the case where the task never starts, but the bug this is aimed to solve is gone.)

return executorService.submit(
String context,
CountDownLatch doneSignal) {
Semaphore doneSignaller = new Semaphore(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of combining a semaphore and a latch, maybe we could use a single shared Phaser instead? A download can register at the start of the lambda and arriveAndDeregister in the finally block. Then close can wait on arriveAndAwaitAdvance. You can search for Phaser in the codebase to find another very similar usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how to implement it using Phaser. We need three phases: Init (0), started (1) and done (2). Calling awaitAdvance(1) to wait for the done phase is not possible while in the init phase. @fmeum What is your idea?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may very well be missing something, but why do we need to track the init phase at all with a Phaser? If the task has never been scheduled by the executor, we also don't need to cancel it. If it has started, it will either have registered with the Phaser (and will thus be awaited with awaitAndAdvance) or not since it hasn't even reached the register call (you can check the return value to avoid registrations happening after the advance).

Just to clarify: With the Phaser approach, addListener also shouldn't be needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with Phaser, which is much better. Check the diff for the two last commits together. Tested on my small reproduction but not in production.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we could even use a single Phaser per context, see https://github.com/fmeum/bazel/tree/23837-phaser. It doesn't give use the opportunity the log the time taken to wait for a particular download cancellation though.

@fmeum
Copy link
Collaborator

fmeum commented Oct 11, 2024

@bazel-io fork 8.0.0

return executorService.submit(
() -> {
if (downloadPhaser.register() < 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (downloadPhaser.register() < 0) {
if (downloadPhaser.register() != 0) {

looks more obviously correct to me, although I am not 100% sure it even makes a difference.

}
try (SilentCloseable c =
Profiler.instance().profile("Cancelling download " + outputPath)) {
downloadPhaser.awaitAdvance(downloadPhaser.arriveAndDeregister());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
downloadPhaser.awaitAdvance(downloadPhaser.arriveAndDeregister());
downloadPhaser.arriveAndAwaitAdvance();

should suffice as we don't care about deregistering (there won't be another phase).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code was written to use Phaser termination as the signalling method. I've now changed to to use phase=0 as signalling method according to the suggestion. The JSON profiles look very nice in the small test case.

public void close() {
// Call register() to ensure arriveAndDeregister() will terminate the
// downloadPhaser if the download has not started yet.
if (downloadPhaser.register() < 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (downloadPhaser.register() < 0) {
if (downloadPhaser.register() != 0) {

@fmeum
Copy link
Collaborator

fmeum commented Oct 12, 2024

@Wyverald Now that we have a pretty clean fix, should we consider cherry-picking this to 7.4.0 as well? Looking at the comment in the 8.0.0 cherry-pick issue, this would not just result in loud failures on Windows, but also in silent rebuilds on other platforms.

@meteorcloudy meteorcloudy added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Oct 14, 2024
@github-actions github-actions bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Oct 15, 2024
@moroten moroten deleted the pr/properly-cancel-repository-downloads branch October 15, 2024 20:16
bazel-io pushed a commit to bazel-io/bazel that referenced this pull request Oct 15, 2024
When cancelling an asynchronous repository download task, an interrupt signal is sent to the download thread. This doesn't mean that the download stops immediately. Avoid restarting a download until the previous download has actually stopped, so that the new download is able to clean old data without crashing (on Windows).

Fixes bazelbuild#21773

Closes bazelbuild#23837.

PiperOrigin-RevId: 686175953
Change-Id: I8d75f905b739d38b6cb430d5b5e84fda9a2d14e3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multithreaded download_and_extract (http_archive) buggy on retry
4 participants