Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bazel spends a lot of time uploading events/cache in CI #26

Open
facundominguez opened this issue Mar 24, 2022 · 11 comments
Open

bazel spends a lot of time uploading events/cache in CI #26

facundominguez opened this issue Mar 24, 2022 · 11 comments

Comments

@facundominguez
Copy link
Member

Here is an example job where bazel completes successfully each time, yet it spends a lot of time at the end talking to buildbuddy.

Ideally, we wouldn't spend so long. We need to either fix our buildbuddy configuration or switch to using caching from github actions.

@aherrmann
Copy link
Member

Here is an example job where bazel completes successfully each time, yet it spends a lot of time at the end talking to buildbuddy.

Looking at that run's logs it doesn't look like it's fetching much from cache, it looks more like it's building a lot from scratch and pushing it to cache. E.g.

INFO: 339 processes: 215 internal, 124 linux-sandbox.

doesn't include any remote cache hit, and the BuildBuddy UI also lists no cache hits and only cache writes for that part of the pipeline.

Would you have expected more cache hits for this CI run?

@facundominguez
Copy link
Member Author

Thanks for looking at this.

Would you have expected more cache hits for this CI run?

Yes, I'm puzzled why there are no cache hits.

Rerunning the job is fast enough and shows cache hits: https://github.com/tweag/gazelle_cabal/runs/5691354186?check_suite_focus=true

@aherrmann
Copy link
Member

Yes, I'm puzzled why there are no cache hits.

How long before did the last build run? Perhaps it was GC'ed on BuildBuddy's side?
Might be worth asking about on the BuildBuddy Slack.

@facundominguez
Copy link
Member Author

facundominguez commented Mar 25, 2022

On March 21st we should have had these entries cached by this job.

However on March 24th the same entries weren't there when this other job run.

@aherrmann
Copy link
Member

Do you know how often does BB garbage collect?

According to Slack there doesn't seem to be a guaranteed period.

@siggisim
Copy link

Posted this in Slack, but posting here as well:

Hey @facundominguez! Unfortunately we don't guarantee a cache TTL for our free-tier users. How frequently artifacts expire from the cache is based on an LRU, so if there is a period during which the artifacts aren't hit - it's likely that they will get evicted.

The exact duration depends on overall system load and isn't very straightforward to predict. We are working on some architectural changes that will allow us to offer longer / more predictable TTLs for both free-tier and paid users.

I'm not sure the title of that bug is exactly right. If you look at the timing profile of one of the invocations from the run - you can see very little time is spent uploading artifacts (on the order of a few seconds throughout the entire build): https://app.buildbuddy.io/invocation/494d73a3-30d7-4db6-8e58-b0c57f7c22ed#timing

Most of the time seems to be spent actually performing the build.

A couple recommendations I'd make to speed up that build:

  • Swap out instances of cloud.buildbuddy.io in your .bazelrc and replace them with remote.buildbuddy.io - this is our new geographically distributed endpoint with greatly improved cache upload/download speed performance.
  • Avoid large gaps in time without runs to avoid LRU evictions.
  • Try setting a higher --jobs= number in your .bazelrc. Right now Bazel is only running 2 tasks at a time (you can see this in the timing profile as well). Setting this to 8 or higher would likely improve parallelism quite a bit and reduce the build time (though Github actions runners aren't the beefiest machines).

@facundominguez
Copy link
Member Author

I answered in slack and also here.

Thanks for the tips. I'll gave them a try.

Just being able to see when caches were last discarded could help to discern if there are other configuration issues.

The title is due to observing the builds while they were running. bazel spends a lot of time printing these four messages at the end of each invocation:

INFO: Build completed successfully, 339 total actions
INFO: Build completed successfully, 339 total actions
INFO: Streaming build results to: https://app.buildbuddy.io/invocation/934c08ea-3f21-4759-b868-e3a1ce6f01a1
INFO: Build completed successfully, 339 total actions

@siggisim
Copy link

Hm, that invocation claims to have taken just 10 minutes total - where are you seeing how long it's taking to print those four messages?

Taking a look at the Github actions, it looks like there are also a ton of Bazel invocations that are being run back-to-back in serial rather than in parallel. There are about 9 Bazel invocations that are run back-to-back (taking ~10 min each)

Running this many invocations back to back, it might also be worth setting this Bazel flag:

--keep_backend_build_event_connections_alive=false

@facundominguez
Copy link
Member Author

Hm, that invocation claims to have taken just 10 minutes total - where are you seeing how long it's taking to print those four messages?

Here's an invocation which takes longer https://github.com/tweag/gazelle_cabal/runs/5635411297?check_suite_focus=true#step:7:275

I see the time it takes to print those messages by looking at the console output while it is being produced by the ongoing job.

@facundominguez
Copy link
Member Author

--keep_backend_build_event_connections_alive=false

Thanks, we already have this here.

@siggisim
Copy link

I see the time it takes to print those messages by looking at the console output while it is being produced by the ongoing job.

I'd try using remote.buildbuddy.io instead of cache.buildbuddy.io to improve upload performance.

Separately, I think increasing the --jobs= flag (currently implicitly set to 2) and running the Bazel invocations in parallel instead of in serial would speed up this pipeline in general.

It also seems like the invocations from the different Github actions steps (i.e. "Build & test", "Test the example", "Test alternative dependencies") aren't hitting each-others cache.

For example if you look at these two actions for the same target (@bazel_gazelle//language/proto:proto) run 25 minutes apart in separate invocations:

They have different digests, different input root digests, and different output digests.

To debug this, I'd recommend capturing --execution_log_json_file= from invocations in those 3 different steps and following the cache hit debugging instructions here to find the root cause:
https://bazel.build/docs/remote-execution-caching-debug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants