Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outboard toolchains #641

Open
cormacrelf opened this issue Jan 28, 2024 · 5 comments
Open

Outboard toolchains #641

cormacrelf opened this issue Jan 28, 2024 · 5 comments
Assignees

Comments

@cormacrelf
Copy link
Contributor

cormacrelf commented Jan 28, 2024

The problem

  1. Remote builds need hermetic toolchains.
  2. Toolchains are much like regular build rules and can be built on-demand.1 Projects like rules_nixpkgs exploit this to great effect.
  3. Toolchains often need absolute paths. Nix especially so.2
  4. However, fixed, absolute paths are completely incompatible with RBE caching. If your toolchain is built by bazel/buck, but must be placed in /nix/store to execute properly, then it can't be tracked / uploaded / downloaded by an executor. So Nix toolchains only work on local machines. So much for hermeticity.

(4) makes things very impractical for Nix-based toolchains, and the only viable solution right now is not to build toolchains on-demand at all. Instead, you must describe every nix dependency you need in advance, package it all up into a Docker image (example), and re-upload the image every time you want to add a tool. When you have a few GB of dependencies, my experience with a similar pattern is that while you can do it, this slows down delivering tooling improvements to a snail's pace. (It is always feasible if you have a "developer platforms team" or similar at a big company. But the Nix project is your "developer platforms team" at a small company.)

Some links:

Why should Native Link help?

I believe this is a pretty signifcant barrier to adopting remote builds. Anyone can use bazel and buck without RBE, it's child's play. But getting a remote-compatible toolchain together is pretty hard, and takes a lot of effort. At my employer, this has almost singlehandedly blocked the adoption of Buck -- the benefits of RBE would take too much time investment to obtain, so the whole thing is not worth it. Anything you can offer to make this work would be a big differentiator.

How can Native Link help?

You're enterprising folks, and you seem to be using Nix yourselves. Do you have any ideas to make on-demand Nix toolchains work? I have two:

1. Absolute output_paths

Is there perhaps a way to bring /nix/store back into the cache? I can imagine a possible extension to the RBE protocol, to support absolute paths in action cache output paths, moderated by a whitelist of /nix/store for example. Tricky because all Nix actions would write into the same directory. If every Nix "store path" were a separate action in the bazel/buck build graph, and you had to provide the actual path in e.g. nix_package(name = ..., expr = "pkgs.coreutils", path = "/nix/store/hcil3fgcjav0y458ff4m98zgcqky00gk-coreutils-9.3"), then this would be doable, especially if you could autogenerate rules like this.

2. Resolver for symlinks to things in /nix/store

Alternatively, there could be special treatment of symlinks to things in /nix/store or any other configurable whitelisted path. That would not require modifying RBE clients for support, because they can all send symlinks already. It wouldn't even require changing rules_nixpkgs, because it already uses symlinks to /nix/store, placed in the bazel output. (I have still not released my version of rules_nixpkgs for buck, but it too works this way.)

The idea involves a symlink resolver, which would attempt to pre-resolve symlinks of some given pattern by just hitting a Nix remote cache. The resolver would run before actions that depend on that symlink execute. This could be configured like so:

  • a whitelist of paths like /nix/store to have a resolver run for, like { "/nix/store": ["./nix_resolver.sh"] }
  • a resolver, i.e. a command line program that is run with any paths to fetch as input. Nix users will want this to run e.g. nix-store --realize /nix/store/hcil3fgcjav0y458ff4m98zgcqky00gk-coreutils-9.3 for each command line argument. Obviously you can configure your own nix remote cache to hit, etc, in the shell script. The resolver can output a list of paths to be hardlinked alongside the rest of the action inputs. (Hardlink? Not sure. Although if Nix is going to be running on the worker nodes in order to resolve these things, might want to let Nix manage the /nix directory itself, copy resolved paths to the NL cache, and then expose those cached paths through some other means, like bind mounts. I'm thinking this would be handled by an environment variable provided to the execution script, and so people can choose to use docker volume mounts to make the paths available.)
  • a cache configuration for resolved paths. You probably don't want to hit a remote store every time you run an action with sh in it, when sh could be stored in the regular cache hierarchy and be treated like any other file.

This would play really nicely with local builds, because your local machine's rules_nixpkgs or equivalent will just be realizing nix paths from a nix cache, symlinking those paths into the bazel/buck output directory, and adding those symlinks as Nix GC roots so that nix store gc doesn't delete them. This will end up having the exact same effect when it comes to remote execution, just that the symlinks will be made to work a different way.

Footnotes

  1. Say you've got a repo with a frontend and a backend. A backend developer might never trigger the toolchain rules that download Node.js in order to build a JS bundle for the frontend.

  2. Installed nix packages live in /nix/store, and contain many absolute path references to other things in /nix/store. This is just a slightly more upfront incarnation of a problem that exists elsewhere; Nix just fails very very fast when it's in the wrong place. All dynamically linked libraries are referenced via absolute paths in ELF files, for example.

@aaronmondal
Copy link
Member

aaronmondal commented Jan 29, 2024

@cormacrelf Thanks for this extensive issue ❤️

Instead, you must describe every nix dependency you need in advance, package it all up into a Docker image (example), and re-upload the image every time you want to add a tool. When you have a few GB of dependencies, my experience with a similar pattern is that while you can do it, this slows down delivering tooling improvements to a snail's pace.

In the LRE setup this is a desirable behavior. We're quite interested in effective scaling to zero which means that we want to reduce cold-start times for worker deployments as much as possible. Embedding the toolchain into the container rather than fetching it on-demand makes a big difference in wall-clock time as it's easier to control container pulls/pushes than fetches from external repos (e.g. rust nightly takes wayyy too long to fetch).

We're planning an important twist to the "standard" RBE container setup though: The nativelink Scheduler can dispatch execution jobs based on platform properties. This means that with a single remote worker endpoint we can make an arbitrary number of different toolchains and targets available and we can create configurations where those different workers all still operate against the same cache.

Personally, I think that slow updates/modifications to toolchain containers are by far the biggest hindrance with existing RBE setups. Also, the fact that 99% of RBE toolchains use generic paths like /usr/bin/clang doesn't make this any better. The risk of accidentally mixing toolchains unintentionally and things not failing immediately is IMO quite scary.

We plan to expand the LRE setup into language-specific containers. For instance, one container for Clang/C++, one container for Cuda (which would effectively be a superset of a clang container), a container for python, one for java etc etc.

Do you have any ideas to make on-demand Nix toolchains work?

  1. Absolute output_paths

This sounds very similar to what we intend to do. As of now, I think the missing piece is a more flexible (or I guess a more "specialized") variant of the rbe_configs_gen tool (https://github.com/bazelbuild/bazel-toolchains). Things that would have to be changed are:

  • More supported toolchains. At least C++, Python, Rust and Go should work (and we probably need java 🥲).
  • More flexibility for custom toolchains. It should be "easy" to add new autogen logic. Considering that, when set up in a "clean" way, the toolchains for most rules all boil down to very similar *_toolchain targets like this. Seems reasonable to provide enough examples there and the toolchain configs in rules_nixpkgs seem to be a good point of reference as well.
  • Try to make generation work without creating an intermediary container. Nix derivations probably have enough information already and the container build/push-to-local/start/inspect that the current rbe_configs_gen tool does seems unnecessaryly slow.
  • If used in conjunction with nixos-generators it's possible to create cloud-vm images for "native" workers that don't make use of containerization at all.

IMO instead of fetching toolchains with bazel, users could use nix as the default environment and have Bazel "inherit" from that environment. This approach also has the benefit that it's much easier to create containers (pkgs.dockerTools) and OS images (nixos-generators) from nix derivations than from Bazel targets. (However, this does go against your footnote 1 as it's AFAIK not possible to fetch "part" of a flake).

I'm not familiar enough with Buck2 toolchains yet to tell whether the same is true for Buck2. We do intend to support Buck2 as first-class citizen at some point though, so this is certainly an area where we're super interested in any information that might be relevant.

  1. Resolver for symlinks to things in /nix/store

I haven't played around with that idea. Initially I'm slightly sceptical whether this would interop nicely with e.g. container runtimes that mount certain paths, like the nvidia-container-toolkit CDI which mounts certain locations that could be unresolvable if it mounts symlinks. I haven't investigated this approach enough yet to be able to make a good assessment here.

This would play really nicely with local builds [...]

As additional datapoint, mirroring the nix-shell and the remote toolchains allows using the LRE toolchains without an actual remote. For instance a nix flake combined with, with a trusted remote cache setup allows sharing each other's build artifacts between local builds on different (but architecturally similar) machines. No remote execution involved at all. This was super useful for a smaller team without the bandwidth to manage a larger scale cloud infra. For instance, perfect cache sharing between Gentoo, WSL2, NixOS, Ubuntu and nix-based containers is doable with this setup.

@cormacrelf
Copy link
Contributor Author

cormacrelf commented Jan 30, 2024

Thanks for your detailed response!

Personally, I think that slow updates/modifications to toolchain containers are by far the biggest hindrance with existing RBE setups.

Too right. The equivalent of this task literally takes a couple of hours for our current setup, which is just a CI pipeline. For that entire time, my poor computer uses 100% CPU and/or thrashes the disk so hard I can't do anything else. I automated large swaths of it and I've still been putting another run off for two weeks.

Worker cold-start times

We're quite interested in effective scaling to zero which means that we want to reduce cold-start times for worker deployments as much as possible.

I figured this might be addressed by the outboard toolchains being cached by Native Link itself, not every worker independently. Both of my ideas involved exploiting the cache-busting properties of the Nix store, where the name of the store path incorporates the content. On a worker cold start, the symlink one would look like (in this file):

// in `fn download_to_directory`

// ambient
let outboard_cache = HashMap::new();

#[cfg(target_family = "unix")]
for symlink_node in directory.symlinks {
    // eg "/nix/store/hcil3fgcjav0y458ff4m98zgcqky00gk-coreutils-9.3"
    if symlink_node.target.starts_with("/nix/store") {
        get_outboard(symlink_node.target);
    }
}

fn get_outboard(target: &str) {
    let fake_outboard_action = Action::new().command(
        // semicolons because that's an invalid path in any league
        ["::nativelink::outboard", matched_symlink_dst]
    );
    let digest = fake_outboard_action.digest();

    // Happy path = no action required, worker already has the action result in the hashmap
    let action_result = outboard_cache.entry(digest).or_insert_with(|x| {
        if let Some(result) = action_cache.GetActionResult(digest) {
            // most worker cold starts = just fetch from the cache
            result
        } else {
            // this branch only taken when you add a dependency, basically
            let resolved = run_configured_resolver(target, fake_outboard_action);
            action_cache.UpdateActionResult(resolved.clone());
            resolved
        }
    });
    // use action_result.output_files & output_directories, but since they will be relative paths,
    // treat them relative to some root that we will bind-mount to /nix/store (or wherever else
    // the user wants to mount it).
    for dir in action_result.output_directories { // and paths, i guess
        download_to_directory(..., dir, "/tmp/some-dir-to-bind-mount-to-nix-store");
    }
}

As you can see, it would piggy-back on almost everything, including that individual components of toolchains that are rarely used (or have stopped being used) get evicted from caches like anything else. And since workers can download only the parts of the toolchains they actually need, worker cold starts have a much, much faster lower bound than downloading a big blob of Docker container with the kitchen sink.

Since toolchains often comprise radically different sized files to the rest of your action outputs, like a few hundred MB of LLVM stuff or "we use a 46GB VDSO file and launch it as a test runner", I would think a separate action cache / store configuration might be necessary. Other than that it would fit right in.

rbe_configs_gen approach

Things that would have to change are:
More supported toolchains. At least C++, Python, Rust and Go should work (and we probably need java 🥲).

It may be useful for some people, but I am not all that interested in depending on a community project to add support for more languages. The first thing I did in Buck was write my own toolchains and rules for things that weren't supported. Being able to do that is really quite important. If it can let you write your own codegen, then sure.

[Most bazel toolchains have] similar *_toolchain targets like this.
[...] I'm not familiar enough with Buck2 toolchains yet to tell whether the same is true for Buck2.

The current recommended API boundary for the Buck toolchains is that you create your own version of the toolchain rule, and you create and return e.g. a RustToolchainInfo provider from that rule. The built-in toolchain rules (system_rust_toolchain) are basically demo use only. The actual rule is going to be named something different and support different arguments in every Buck project; there are no open source rule definitions like ll_toolchain to build codegen around.

Other

interop nicely with e.g. container runtimes that mount certain paths, like the nvidia-container-toolkit CDI

Users would have have control over the paths that are intercepted & resolved. While people could technically bind-mount at /usr/bin, I don't think anyone will, and if they choose to do something silly like that, it's not my problem. And if they do not register a resolver for /nvidia paths, then symlinks to /nvidia/... will not result in any additional mounting.

nix flake combined with, with a trusted remote cache setup [...] This was super useful for a smaller team without the bandwidth to manage a larger scale cloud infra.

I have looked into that, unfortunately Buck2 doesn't have any form of isolation when you execute locally so it can't detect when people accidentally or otherwise don't use Nix and immediately start poisoning caches. So a trusted cache with local execution with Buck2 is safe for a one person team, but no more than that. RBE is the the only way to enforce it at the moment (coincidentally Facebook runs all their builds on RBE wherever possible).

@aaronmondal
Copy link
Member

Too right.

This strengthens my belief that we really need to make sure that "ease of bumping dependencies" should be a major focus for whatever we end up implementing. Ultimately, it might even make sense to empirically test different approaches (docker vs lazy loading etc).

Worker cold-start times

FYI We briefly considered a nixStore implementation. Such a store could be a wrapper around any arbitrary backend like the S3Store, a FilesystemStore or an InMemoryStore. We already have a bunch of other wrapper stores and initially this seems like it's on the "easier" side of things we could implement.

However, something that is essentially a nix API wrapper could also be considered a potential feature drift. Personally, I'm quite open to the idea - I literally tried to merge the codebases for attic and NativeLink at some point lol.

@MarcusSorealheis @adam-singer @allada What are your opinions on this? I think a NixStore or similar seems reasonable and doesn't strike me as too much of a feature drift/maintenance burden. The attic codebase should give us a bunch of pointers to get this working.

Since toolchains often comprise radically different sized files to the rest of your action outputs, like a few hundred MB of LLVM stuff or "we use a 46GB VDSO file and launch it as a test runner", I would think a separate action cache / store configuration might be necessary. Other than that it would fit right in.

I believe @allada might have some valuable insights on this. WDYT to what degree should toolchains be in container-land vs Starlark-land?

My current line of thought is:

  • The default for most projects is "one container per build"
  • What I currently think should be done is "one container per toolchain"
  • I believe what @cormacrelf is proposing is "one container per build with better lazy fetching of toolchains outside of the Bazel build graph with eviction etc handled by NativeLink (please correct me if I'm wrong).

@cormacrelf Regarding the code example, I feel like I'm not grasping things in their entirety. Could you elaborate what the benefit of this is as opposed to using a rules_nixpkgs-like approach in Starlark-land?

rbe_configs_gen approach
[...] If it can let you write your own codegen, then sure.
[...] return e.g. a RustToolchainInfo provider from that rule.

Yes. A framework that easily allows creating custom codegen pipelines is what I had in mind. PoC-wise just a translation from nix derivations to RBE toolchain generators.

FYI I've also played with the idea to draw inspiration from LLVM/MLIR similar to how Mitosis does it to create some sort of "NativeLink IR" for toolchains. @allada and I talked about this concept a bunch of times but the scope is daunting. Imagine a framework that allows creating RBE-compatible toolchains from some external source via some sort of pluggable to-proto and from-proto generators:

nix -> NLIR
Bazel + docker with preinstalled toolchain -> NLIR
CMake + docker with preinstalled toolchain -> NLIR
Make + docker with preinstalled toolchain -> NLIR

NLIR ->  Bazel toolchains (generates rule that returns XxToolchainInfo provider)
NLIR -> Buck2 toolchains (generates rule that returns XxToolchainInfo provider)

Other
[...] I don't think anyone will, and if they choose to do something silly like that, it's not my problem.

CDI mounts specific executables. Not the entire /usr/bin, but e.g. /usr/bin/nvidia-smi. A major concern I have is out-of-the-box compatibility with clusters running the nvidia-gpu-operator which makes use of CDI configs under the hood. Recent developments in NixOS make me optimistic that this will be a non-issue though: https://github.com/NixOS/nixpkgs/pull/284507/files#diff-57782cd16247ea1d169fe691e68298764e647ae50a8dfbff60f8060e0a0bc24f

@MarcusSorealheis
Copy link
Collaborator

@cormacrelf can you email me please? I do not intend to dox you or sell you.

@cormacrelf
Copy link
Contributor Author

"one container per build with better lazy fetching of toolchains outside of the Bazel build graph with eviction etc handled by NativeLink (please correct me if I'm wrong).

✔️ although "one container per build" => maybe you mean "one container image per repo", which also describes the default for most projects, to my mind.

@cormacrelf Regarding the code example, I feel like I'm not grasping things in their entirety. Could you elaborate what the benefit of this is as opposed to using a rules_nixpkgs-like approach in Starlark-land?

It is meant to support a rules_nixpkgs approach in starlark land, which does not work on its own. It's not an alternative to rules_nixpkgs, rather it is an alternative to pre-loading nix dependencies in a container image.

I think I know what you mean -- my pseudocode looks like it could be done in starlark! Just have a build rule fetch a nix package... right? But it cannot work on a remote, because of all of the above about absolute paths, but again for clarity:

  • If you ran a rules_nixpkgs package action ON the remote (which it does not allow you to do!) then the things Nix downloads would be gone before you ran the next action. The actions are all local-only for this reason.
  • Using local-only actions + RBE's caching APIs alone, you cannot get an RBE remote to place built nix packages in /nix/store, which means they can never be executed on a remote. Even if you resolved the symlinks locally and copied them to the bazel/buck output directory, you would get cryptic errors like file not found which you strace down to /nix/store/ld03l52xq2ssn4x0g5asypsxqls40497-glibc-2.37-8/lib64/ld-linux-x86-64.so.2 being missing. Because it is not there. I know because I've done it.

You can make rules_nixpkgs work on a remote by pre-building a container image that has every conceivable nix path already there, so that when you upload a symlink (made by a local-only nix build that rules_nixpkgs runs), it is still valid on the remote. If you wanted to do this part with starlark, starlark alone is not capable of ensuring that container image is uploaded before the remote executes anything -- your RBE docker image is at best a config flag you provide to bazel or buck, so you would need a bazel wrapper of some kind. Bazel and especially buck wrappers are pretty tricky given that, at least in Buck, your toolchain rules and their dependencies are parametrized by both execution and target platforms. They are completely dynamic, and nix_build(pkg = "pkgs.libgcc") will happily fetch a different version of libgcc when you tell Buck to build for a different target CPU architecture. I know that Bazel is much less flexible about that. Buck is so flexible at constructing toolchains, and they're so blended with regular build targets, that I can hardly conceive of a way to enumerate all the possible nix paths to include in a container image if they were done this way.

What does the rules_nixpkgs approach give you in the output directory + on the remote?

For example, zlib. When you run the local-only rules_nixpkgs rules, in your output folder it dumps a bunch of symlinks, roughly like this, which is just the symlink output of nix-build on a simple nix expression that outputs pkgs.zlib.

bazel's/output/dir/some_hash/__zlib__/output -> /nix/store/8xgb8phqmfn9h971q7dg369h647i1aa0-zlib-1.3
bazel's/output/dir/some_hash/__zlib__/output-dev -> /nix/store/9a4ivnm2hwggi6qjg0gcpk05f5hsw5r5-zlib-1.3-dev
-- or the source --
bazel's/output/dir/some_hash/__zlib__/output-src -> /nix/store/blah-zlib-1.3.tar.gz

Obviously this goes beyond toolchain rules, there is no such thing as a "zlib toolchain". But rules_nixpkgs does exactly the same thing when you ask it to get clang, and then depend on that from your toolchain rules. In which case, you will have:

bazel's/output/dir/some_hash/__clang__/output -> /nix/store/blah-clang-15.0.1
bazel's/output/dir/some_hash/__compiler_rt__/output-dev -> /nix/store/blah-compiler-rt-15.0.1
bazel's/output/dir/some_hash/__glibc__/output-dev -> /nix/store/mrgib0s2ayr81xv1q84xsjg8ijybalq3-glibc-2.38-27-dev
etc...

And your toolchain will make executables that look like ${output}/bin/clang++ and ${output}/bin/ld.lld, and header include directories like ${output-dev}/include etc. The ${output} directories become output_symlinks from a previous action that a Native Link worker has to write out to actual symlinks at execution time. Since your toolchain will generally invoke clang as clang++ --rtlib ${compiler_rt} -idirafter ${glibc_dev}/include etc, a "build main.cxx to main.o" action will depend on a bunch of clang-related symlinks being present, and the worker will do this, roughly:

PREPARE DIR      bazel's output/dir/some_other_hash/__my_target__
CREATE SYMLINK   bazel's/output/dir/some_hash/__clang__/output -> /nix/store/blah-clang-15.0.1
CREATE SYMLINK   bazel's/output/dir/some_hash/__compiler_rt__/output-dev -> /nix/store/blah-compiler-rt-15.0.1
LOAD FILE        main.cxx
EXECUTE          bazel's/output/dir/some_hash/__clang__/output/bin/clang++ main.cxx -o bazel's/output/dir/some_other_hash/__my_target__/main.o

Currently, that execute step fails when your container image does not have Nix clang and the symlink deref fails. Or if you just bumped clang to 16.0.0, and the container image you had built is still on 15.0.1. My proposal is that Native Link just runs a script that fetches the symlink target from a nix remote and caches its results, so your container images do not need to contain any Nix stuff in them, not even an installation of Nix. (You would need to install Nix on the workers themselves.)

If it helps, this is an outdated and fairly dodgy version of my Nix Clang toolchain rule for Buck2, which basically does what Nix's clang wrapper package does, but using clang-unwrapped, and relying entirely on buck's cxx rules instead of NIX_ environment variables. https://gist.github.com/cormacrelf/c28f400b87eb5a285435f94459fae48a

How specific to nix does the feature have to be?

However, something that is essentially a nix API wrapper could also be considered a potential feature drift.

To some extent, the resolver + cache approach would have to be designed to support nix, as each of those symlinks will usually depend on a bunch of other nix paths that aren't named in the one symlink. Most nix packages depend in some way on glibc, for instance, and you don't want 400 copies of glibc in the cache, or indeed 12 copies of the entirety of LLVM under other downstream deps' cache keys. Maybe you can always execute the resolver, which prints a list of paths (the transitive deps) to possibly fetch from cache. You can cache that step as well.

But I don't think you have to take it much further than that. It is plausible that building it in a generic way will allow people to come up with other creative ideas, like having the resolver directly mount some readonly network drive if the build tries to reference it, and emit no paths for NL to fetch.

@MarcusSorealheis MarcusSorealheis self-assigned this Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants