Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for impact of 2024 MacOS fleet expansion #1981

Closed
fweikert opened this issue Jul 5, 2024 · 48 comments
Closed

Tracking issue for impact of 2024 MacOS fleet expansion #1981

fweikert opened this issue Jul 5, 2024 · 48 comments

Comments

@fweikert
Copy link
Member

fweikert commented Jul 5, 2024

We've moved most of our MacOS CI workload from lab machines to a larger number of less powerful VMs (#1708).

Please reply to this issue if you've encountered any MacOS issues related to the migration.

@meteorcloudy
Copy link
Member

FYI @keith , not sure the latest rules_apple failures are related to this or not: https://buildkite.com/bazel/rules-apple-darwin/builds/9311

@keith
Copy link
Member

keith commented Jul 8, 2024

yea they are, looks like there was an Xcode update as part of this, which is good, tracking fix here bazelbuild/rules_apple#2488

@Wyverald
Copy link
Member

Wyverald commented Jul 8, 2024

there are some BEP-related errors in this presubmit that only happen on macOS and don't go away after retries: https://buildkite.com/bazel/bcr-presubmit/builds/6541#019092db-1572-49fe-b7e3-2f489764d67a

@ahumesky
Copy link
Contributor

ahumesky commented Jul 9, 2024

It looks like rules_jvm_external is having trouble on the new macs with whatever version of java they have:
https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3973#01909143-b698-405c-b925-02461d6dbdab
https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3973#01909085-2f47-4ce2-a9f9-21637ede87eb

(09:35:35) ERROR: An error occurred during the fetch of repository 'rules_android_maven':
   Traceback (most recent call last):
	File "/private/var/tmp/_bazel_buildkite/e429569ac118389554140bb6d293e1ed/external/rules_jvm_external/coursier.bzl", line 985, column 38, in _coursier_fetch_impl
		dep_tree = make_coursier_dep_tree(
	File "/private/var/tmp/_bazel_buildkite/e429569ac118389554140bb6d293e1ed/external/rules_jvm_external/coursier.bzl", line 889, column 30, in make_coursier_dep_tree
		if parse_java_version(exec_result.stdout + exec_result.stderr) > 8:
	File "/private/var/tmp/_bazel_buildkite/e429569ac118389554140bb6d293e1ed/external/rules_jvm_external/private/java_utilities.bzl", line 48, column 29, in parse_java_version
		return get_major_version(first_line[i:j])
	File "/private/var/tmp/_bazel_buildkite/e429569ac118389554140bb6d293e1ed/external/rules_jvm_external/private/java_utilities.bzl", line 32, column 19, in get_major_version
		return int(java_version.split(".")[0])
Error in int: empty string

I can try updating the version of rules_jvm_external, it looks like the latest version doesn't have this parse_java_version function

@ahumesky
Copy link
Contributor

ahumesky commented Jul 9, 2024

I also notice that the mac arm64 tests on rules_android are failing with "Bad CPU type in executable" when it tries to execute aapt2, I haven't looked into that further
https://buildkite.com/bazel/rules-android/builds/2789#01908fb0-ef50-4143-b77d-c5e335830cc4

@Wyverald
Copy link
Member

Wyverald commented Jul 9, 2024

hmm. Also seeing macOS jobs on Bazel 6.x stuck on Waiting for remote cache: 1 upload after a successful build (see https://buildkite.com/bazel/bcr-presubmit/builds/6573#019094af-bd7d-44a6-98f5-a9068585e48e). Bazel 7.x are okay, so this might have been a bug in Bazel 6.x that got fixed.

@ahumesky
Copy link
Contributor

ahumesky commented Jul 9, 2024

bazelbuild/rules_android#244 fixes the parse_java_version failure

@ahumesky
Copy link
Contributor

ahumesky commented Jul 9, 2024

I also notice that the mac arm64 tests on rules_android are failing with "Bad CPU type in executable" when it tries to execute aapt2, I haven't looked into that further https://buildkite.com/bazel/rules-android/builds/2789#01908fb0-ef50-4143-b77d-c5e335830cc4

Ok so downloading the android sdk build tools for mac:
~/android-sdk$ REPO_OS_OVERRIDE=macosx ./cmdline-tools/7.0/bin/sdkmanager --sdk_root=$HOME/android-sdk-macosx --install "build-tools;30.0.3" "build-tools;35.0.0"

and then examining the appt2 binary for the currently installed build tools vesrion 30.0.3:
~/android-sdk$ file ~/android-sdk-macosx/build-tools/30.0.3/aapt2 /usr/local/google/home/ahumesky/android-sdk-macosx/build-tools/30.0.3/aapt2: Mach-O 64-bit x86_64 executable, flags:<NOUNDEFS|DYLDLINK|TWOLEVEL|WEAK_DEFINES|BINDS_TO_WEAK|PIE|HAS_TLV_DESCRIPTORS>

then the latest version 35.0.0:
~/android-sdk$ file ~/android-sdk-macosx/build-tools/35.0.0/aapt2 /usr/local/google/home/ahumesky/android-sdk-macosx/build-tools/35.0.0/aapt2: Mach-O universal binary with 2 architectures: [x86_64:\012- Mach-O 64-bit x86_64 executable, flags:<NOUNDEFS|DYLDLINK|TWOLEVEL|WEAK_DEFINES|BINDS_TO_WEAK|PIE|HAS_TLV_DESCRIPTORS>] [\012- arm64:\012- Mach-O 64-bit arm64 executable, flags:<NOUNDEFS|DYLDLINK|TWOLEVEL|WEAK_DEFINES|BINDS_TO_WEAK|PIE|HAS_TLV_DESCRIPTORS>]

The currently installed 30.0.3 has only x86_64, and the latest version is both a x86_64 and arm64 binary.

The arm64 tests were passing before, but I'm not sure how or why. I guess there was some x86_64 emulation going on that's not enabled on the new runners? In any case, we should just update the android sdk build tools on the CI images (https://github.com/bazelbuild/continuous-integration/blob/master/macos/mac-android.sh etc) to the latest build tools 35.0.0

@meteorcloudy
Copy link
Member

there are some BEP-related errors in this presubmit that only happen on macOS and don't go away after retries: https://buildkite.com/bazel/bcr-presubmit/builds/6541#019092db-1572-49fe-b7e3-2f489764d67a

Probably unrelated to macOS migration, see bazelbuild/bazel-central-registry#2373 (comment)

@meteorcloudy
Copy link
Member

meteorcloudy commented Jul 9, 2024

If your builds on Intel macOS frequently run into timeouts like bazelbuild/rules_go#3969 (comment) after this change, this is because the machines in the new fleet is less powerful than previous ones (2 core Mac Mini vs 20 core iMac Pro)

Consider applying the following changes:

  • Migrate to the Apple Silicon platform (macos_arm64), Intel macOS is the legacy platform anyway.
  • Add --local_test_jobs=2 to test flags to reduce the number of tests running in parallel This is already done automatically by CI.
  • Shard your task by adding shards:<N> in your presubmit.yml file (Please don't set N too large since we still have limited amount of machines available)
  • Bump your test size to allow longer timeout

@meteorcloudy
Copy link
Member

hmm. Also seeing macOS jobs on Bazel 6.x stuck on Waiting for remote cache: 1 upload after a successful build (see https://buildkite.com/bazel/bcr-presubmit/builds/6573#019094af-bd7d-44a6-98f5-a9068585e48e).

This is caused by --experimental_remote_cache_async, and there is still a bug in bazel 6.x. We can remove this flag from default CI step. @fweikert

@jsharpe
Copy link
Contributor

jsharpe commented Jul 9, 2024

rules_foreign_cc is having issues with the new macOS runners too: https://buildkite.com/bazel/rules-foreign-cc/builds/5745#_

@meteorcloudy
Copy link
Member

rules_foreign_cc is having issues with the new macOS runners too: https://buildkite.com/bazel/rules-foreign-cc/builds/5745#_

Probably mesonbuild/meson#12282 ?

@jsharpe
Copy link
Contributor

jsharpe commented Jul 9, 2024

rules_foreign_cc is having issues with the new macOS runners too: buildkite.com/bazel/rules-foreign-cc/builds/5745_

Probably mesonbuild/meson#12282 ?

Looks like it but the fix is probably something that the apple toolchain should apply, as rules_foreign_cc can't really know the version of xcode and when its needed? Any opinions @keith?

copybara-service bot pushed a commit to bazelbuild/rules_android that referenced this issue Jul 9, 2024
…sue on new mac CI machines. See bazelbuild/continuous-integration#1981 (comment)

Closes #244
Closes #242

PiperOrigin-RevId: 650660707
Change-Id: I3e93a893684786951e1f345ee9749ee62ec7049e
@keith
Copy link
Member

keith commented Jul 9, 2024

hrm seems like the repo's files should deal with this if they need it? theoretically you could know what the current xcode version was if you wanted, but unless you're setting that version wrong today I don't think that would help with downstream issues?

github-merge-queue bot pushed a commit to bazelbuild/bazel that referenced this issue Jul 10, 2024
- Upgraded Bazel to 7.2.1 to include remote cache fixes
- Backport changes to adapt
bazelbuild/continuous-integration#1981

---------

Co-authored-by: Googler <[email protected]>
@brentleyjones
Copy link

WARNING: ignoring JAVA_TOOL_OPTIONS in environment.
--
  | Jul 10 18:56:13.449 ERROR bazelci_agent::artifact::upload: Failed to read /tmp/tmpft73_y33/test_bep.json
  |  
  | Caused by:
  | No such file or directory (os error 2)
  | Jul 10 18:56:13.451 ERROR bazelci_agent::artifact::upload: Failed to read /tmp/tmpft73_y33/test_bep.json
  |  
  | Caused by:
  | No such file or directory (os error 2)
  | Jul 10 18:56:13.451 ERROR bazelci_agent::artifact::upload: Failed to read /tmp/tmpft73_y33/test_bep.json
  |  
  | Caused by:
  | No such file or directory (os error 2)
  | Jul 10 18:56:13.451 ERROR bazelci_agent::artifact::upload: Failed to read /tmp/tmpft73_y33/test_bep.json
  |  
  | Caused by:
  | No such file or directory (os error 2)
  | Jul 10 18:56:13.451 ERROR bazelci_agent::artifact::upload: Failed to read /tmp/tmpft73_y33/test_bep.json
  |  
  | Caused by:
  | No such file or directory (os error 2)
  | Error: Failed to read /tmp/tmpft73_y33/test_bep.json
  |  
  | Caused by:
  | No such file or directory (os error 2)
  | (18:56:15) WARNING: Option 'experimental_remote_build_event_upload' is deprecated: Use --remote_build_event_upload instead

https://buildkite.com/bazel/bcr-presubmit/builds/6641#01909d90-e33f-49cf-ba58-fcd588073673/415-445

@lkassar-stripe
Copy link

Any chance I could get some help with one of the MacOS failures on my PR? It seems like its probably related to this issue, please let me know if there's any configurations I should be changing here. This is my first contribution here so any advice welcome! bazelbuild/bazel-gazelle#1822

@keith
Copy link
Member

keith commented Jul 11, 2024

Likely unrelated, I don't have access to retry it but hopefully someone can

@meteorcloudy
Copy link
Member

@lkassar-stripe The timeout is likely caused by the Bazel binary cannot reach to the network in an ipv6-only network.
See my comment here: bazelbuild/rules_go#3969 (comment)

Unfortunately, the test framework needs to inject --host_jvm_args=-Djava.net.preferIPv6Addresses=true as a startup option somewhere. If this is urgent, you can temporarily switch back to macos_legacy or macos_arm64_legacy platform to use the old Mac fleet. But we will remove them soon in the future.

@meteorcloudy
Copy link
Member

ERROR bazelci_agent::artifact::upload: Failed to read /tmp/tmpft73_y33/test_bep.json

@coeuvre Do you know what's happening?

I'm also seeing

Command '['/tmp/tmpft73_y33/bazelci-agent', 'artifact', 'upload', '--debug', '--delay=5', '--mode=buildkite', '--build_event_json_file=/tmp/tmpft73_y33/test_bep.json']' returned non-zero exit status 1.

from time to time.

@coeuvre
Copy link
Member

coeuvre commented Jul 11, 2024

Hard to tell from this error message. Do you have link to the build which produced the error?

@meteorcloudy
Copy link
Member

Yes, this one #1981 (comment)

@sluongng
Copy link

@meteorcloudy I am having problem in Bazel.git as well.

Currently on my MacOS, to run a shell test, I would have to apply this patch

diff --git a/src/test/shell/bazel/remote_helpers.sh b/src/test/shell/bazel/remote_helpers.sh
index c6eba4dcad..5cc3b3036a 100755
--- a/src/test/shell/bazel/remote_helpers.sh
+++ b/src/test/shell/bazel/remote_helpers.sh
@@ -16,7 +16,7 @@

 set -eu

-setup_localjdk_javabase
+# setup_localjdk_javabase

 # Serves $1 as a file on localhost:$nc_port.  Sets the following variables:
 #   * nc_port - the port nc is listening on.
diff --git a/src/test/shell/testenv.sh.tmpl b/src/test/shell/testenv.sh.tmpl
index 4058bcf74d..efa11a368d 100755
--- a/src/test/shell/testenv.sh.tmpl
+++ b/src/test/shell/testenv.sh.tmpl
@@ -390,11 +390,11 @@ EOF
     echo "startup --install_base=$TEST_INSTALL_BASE" >> $TEST_TMPDIR/bazelrc
   fi

-  if is_darwin; then
-    echo "Add flags to prefer ipv6 network"
-    echo "startup --host_jvm_args=-Djava.net.preferIPv6Addresses=true" >> $TEST_TMPDIR/bazelrc
-    echo "build --jvmopt=-Djava.net.preferIPv6Addresses" >> $TEST_TMPDIR/bazelrc
-  fi
+  # if is_darwin; then
+  #   echo "Add flags to prefer ipv6 network"
+  #   echo "startup --host_jvm_args=-Djava.net.preferIPv6Addresses=true" >> $TEST_TMPDIR/bazelrc
+  #   echo "build --jvmopt=-Djava.net.preferIPv6Addresses" >> $TEST_TMPDIR/bazelrc
+  # fi
 }

This is the test command I am running on my mac bazel test src/test/shell/bazel/remote:remote_execution_test

Without commenting out the ipv6 flags, I would get

-- Test log: -----------------------------------------------------------
$TEST_TMPDIR defined: output root default is '/private/var/tmp/_bazel_sluongng/4528d0377c5d584786af4e2410173691/sandbox/darwin-sandbox/4912/execroot/_main/_tmp/ee9047f7e54192d1c96723be06c02001' and max_idle_secs default is '15'.
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Invocation ID: 71d0a02e-0619-44f8-abb8-383473c5d0ff
Computing main repo mapping:
Computing main repo mapping:
ERROR: Error computing the main repository mapping: Error accessing registry https://bcr.bazel.build/: Failed to fetch registry file https://bcr.bazel.build/modules/rules_proto/4.0.0/MODULE.bazel: No route to host
------------------------------------------------------------------------

@meteorcloudy
Copy link
Member

Yeah, this is the same problem. We probably should also remove --nosystem_rc from https://cs.opensource.google/bazel/bazel/+/master:src/test/shell/bin/bazel;l=26 and let the Bazel binary inside the test to pick up the ipv6 flag from /private/etc/bazel.bazelrc on the Mac VM, so that user builds are not affected.

/cc @fweikert

@sluongng
Copy link

I wonder if it's a good opportunity to make this a platform constraint somehow and augment the tests with a select.

@fweikert
Copy link
Member Author

Filed #2003 to track the Intel MacOS problem with test analytics (caused by this migration).

@dws
Copy link

dws commented Jul 23, 2024

Possible MacOS issue here: https://buildkite.com/bazel/bazel-bazel-github-presubmit/builds/22496#0190dcaf-a704-424e-9441-dd7426239485
CI passed on a rerun.

ERROR: /Users/buildkite/builds/bk-macos-pln3-1zzn/bazel/bazel-bazel-github-presubmit/src/main/java/com/google/devtools/build/lib/bazel/BUILD:219:12: Building deploy jar src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar [for tool] failed: I/O exception during sandboxed execution: 19 errors during bulk transfer:

@sluongng
Copy link

sluongng commented Jul 23, 2024

Im also getting a new persistent issue on a specific MacOS shard:

https://buildkite.com/bazel/bazel-bazel-github-presubmit/builds/22503#0190dede-b514-430d-9794-2bb5847c81fe/383-417

Error in download_and_extract: java.io.IOException: Error extracting /private/var/tmp/_bazel_buildkite/dc75035d57eb8cda9b01612d7ce9b862/external/rules_javatoolchains~remotejdk17_macos/temp5719727540427988246/zulu17.50.19-ca-jdk17.0.11-macosx_x64.tar.gz to /private/var/tmp/_bazel_buildkite/dc75035d57eb8cda9b01612d7ce9b862/external/rules_javatoolchains~remotejdk17_macos/temp5719727540427988246: Corrupted TAR archive.

I did try to push another commit to re-trigger CI for this PR but the same issue happened again.

@meteorcloudy
Copy link
Member

And it happened on the same machine bk-macos-pln3-90qy (build 1, build 2). @fweikert Can we report such errors back to the internal team offering the VMs?

@FaBrand
Copy link

FaBrand commented Jul 23, 2024

For my builds a few of the macOS ones failed with Exited with status -1 (agent lost)
https://buildkite.com/bazel/bazel-bazel-github-presubmit/builds/22506#0190dfc4-1964-461d-945a-8c3002019307
image

Some even in the workspace preparation Stage
image

Or in Waiting for agent
image

@meteorcloudy
Copy link
Member

@FaBrand We are deploying an update to the VMs, sorry for the inconvenience.

@FaBrand
Copy link

FaBrand commented Jul 23, 2024

@FaBrand We are deploying an update to the VMs, sorry for the inconvenience.

Thanks for the Info @meteorcloudy :) Do i have to do something, to retrigger my checks?

@meteorcloudy
Copy link
Member

There is already a retry job scheduled for each lost agent, they will run as soon as the VMs are online.

@criemen
Copy link

criemen commented Jul 23, 2024

I'm not sure if it's related to the new Mac CI workers, but as this issue is prominently featured on buildkite, I'll report it here. If you think it's unrelated, I'm happy to report this as seperate issue elsewhere.

I'm hitting

(00:54:28) ERROR: /Users/buildkite/builds/bk-macos-pln3-yigd/bazel/bazel-bazel-github-presubmit/src/test/java/com/google/devtools/build/lib/skyframe/BUILD:1527:10: Building src/test/java/com/google/devtools/build/lib/skyframe/LocalRepositoryLookupFunctionTest.jar (1 source file) failed: Worker process returned an unparseable WorkResponse!
Did you try to print something to stdout? Workers aren't allowed to do this, as it breaks the protocol between Bazel and the worker process.
---8<---8<--- Start of response ---8<---8<---
Not UTF-8, printing as hex
02 18 01 02 18 03 02 18  02 02 18 04 02 18 06 02  |........ ........|
18 05 02 18 07 02 18 01  02 18 02 02 18 03 02 18  |........ ........|
04 02 18 06 02 18 05 02  18 07 02 18 08 02 18 09  |........ ........|
02 18 0A 02 18 0C 02 18  0B 02 18 0D 02 18 0E 02  |........ ........|
18 0F 02 18 11 02 18 10  02 18 12 02 18 13 02 18  |........ ........|
14 02 18 15 23 0A 23 20  41 20 66 61 74 61 6C 20  |....#.#  A fatal |
65 72 72 6F 72 20 68 61  73 20 62 65 65 6E 20 64  |error ha s been d|
65 74 65 63 74 65 64 20                           |etected          |
---8<---8<--- End of response ---8<---8<---
---8<---8<--- Exception details ---8<---8<---
java.io.IOException: Worker process for Javac has died
...

on https://buildkite.com/bazel/bazel-bazel-github-presubmit/builds/22520#0190e1c7-b7fe-4b6b-934c-624bee706f7d which is an exception I've never come across yet.

@FaBrand
Copy link

FaBrand commented Jul 24, 2024

@meteorcloudy FYI There is one shard that wasn't retried again after 4 Agent losses:
https://buildkite.com/bazel/bazel-bazel-github-presubmit/builds/22506#0190e052-2f69-493f-bb0a-c11b6f93dcf1

@FaBrand
Copy link

FaBrand commented Jul 24, 2024

@meteorcloudy FYI There is one shard that wasn't retried again after 4 Agent losses: https://buildkite.com/bazel/bazel-bazel-github-presubmit/builds/22506#0190e052-2f69-493f-bb0a-c11b6f93dcf1

I guess someone retriggered it, thank you dear anonymous helper 💚

@albertocavalcante
Copy link

bazel/buildtools: macOS arm64

I'm facing an error on this build:

/bin/bash: external/jq_darwin_arm64/jq: Bad CPU type in executable

More info:

ERROR: /Users/buildkite/builds/bk-macos-arm64-cn9q/bazel/buildtools/buildifier/npm/BUILD.bazel:43:21: Jq buildifier/npm/package.json failed: (Exit 126): bash failed: error executing Jq command (from target //buildifier/npm:package)
--
  | (cd /private/var/tmp/_bazel_buildkite/4e34735c8e9a8b6c609838cd5f90f526/sandbox/darwin-sandbox/629/execroot/buildtools && \
  | exec env - \
  | /bin/bash -c 'external/jq_darwin_arm64/jq  '\''$ARGS.named.STAMP as $stamp\|.version = ($stamp.BUILD_SCM_VERSION // "0.0.0")'\'' '\''buildifier/npm/package.json'\'' > bazel-out/darwin_arm64-fastbuild/bin/buildifier/npm/package.json')
  | # Configuration: 05e9e48b199e39946b162f1e18d6c3760aa2e4131e967a19ba1e1be8890508b2
  | # Execution platform: @@internal_platforms_do_not_use//host:host
  |  
  | Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
  | /bin/bash: external/jq_darwin_arm64/jq: Bad CPU type in executable

@meteorcloudy
Copy link
Member

We need the fix from bazelbuild/bazel#18444, which is only included in 7.3.0. We can pin Bazel version for buildtools to 7.3.0rc1. /cc @vladmos

@pat-jpnk
Copy link

pat-jpnk commented Aug 8, 2024

Facing the following build error for MacOS, related to this merge request

@meteorcloudy
Copy link
Member

CI infra problems for the new macOS fleet should have been addressed already, closing this one for now

fweikert added a commit to fweikert/continuous-integration that referenced this issue Sep 10, 2024
fweikert added a commit to fweikert/continuous-integration that referenced this issue Sep 10, 2024
fweikert added a commit to fweikert/continuous-integration that referenced this issue Sep 10, 2024
fweikert added a commit that referenced this issue Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests