-
-
Notifications
You must be signed in to change notification settings - Fork 216
Building the toolchain with PGO #503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Don't use tools from $PREFIX/bin automatically; this can be confusing and problematic. If running with $PREFIX/bin in path, the build at this stage gets reconfigured if rebuilding after installing clang. For multistage builds, install into separate toolchain directories and add the right one to $PATH before invoking build-llvm.sh. If one really wants to use the tools from $PREFIX for a new build into the same directory, one has to add that directory to the path manually.
When cross compiling with llvm-mingw, we already implicitly use clang and lld, but we don't need to specify these options as they are set implicitly via the llvm-mingw wrappers.
This augments the result of an earlier build, with tools required for building for macOS with the tools built with build-llvm.sh.
Passing --instrumented=IR or --instrumented=Frontend builds instrumented tool binaries. After running them and merging the output profile files, run build-llvm.sh with --profile=profile.profdata.
Use a makefile for running a set of commands in parallel.
As build-llvm now can create a larger variety of build dir names, enumerating all possible suffixes becomes problematic (especially as they don't necessarily appear in one canonical order). Instead, iterate over potential matches. If cross compiling, it is easy as we can require the directory to contain the expected suffix. If not cross compiling, we have no explicit suffix to look for, but we can check that the directory doesn't match common cross compilation triples.
Do the same as when cross compiling; disallow detecting dependencies outside of the given CMAKE_FIND_ROOT_PATH. Don't export LLVM_DIR as we no longer need to pass it implicitly that way. This makes the errors clearer, if LLVM isn't found where expected, instead of misdetecting it from a different installation of LLVM in the system. If lldb-mi would try to find other optional dependencies, they would no longer be found; this was already the case when cross compiling though.
Allow both doing one single stage at a time, or driving the full build with three stages all in one command.
This makes a cross compiler available before doing the full second stage.
Using Frontend instrumentation rather than IR. This gives less total speedup, but makes the profile much more usable for cross other build targets. Build the full Linux toolchain in stage1; this avoids needing to rebuild runtimes for doing the profiling in stage2, and avoids needing to wait for stage3 to complete before cross building for Windows.
…olchain This makes the docker image contain a PGO optimized compiler, without needing to redo all of the PGO build stages in docker. This reverts commit 8f42112, and fixes building the multiplatform image in a simpler way. Instead of doing two separate builds, with separate Dockerfiles, just do one multiplatform build of one Dockerfile, which docker runs serially, for each one of the included architectures, just packaging the prebuilt toolchains in the docker image.
For this case, it can be assumed that there is some distortion in the profdata. Maybe use --sparse to merge the profraw to avoid cold code being size optimized? |
What a great write-up! I see here several interesting discussion points so ;)
Have you done a comparison between FatLTO (aka FullLTO) vs ThinLTO for this use case? I don't have recent benches but luckily Google provided them in the original ThinLTO paper. How actual they are nowadays I don't know due to developments in the LTO field since the paper was published, so probably we can these benchmarks once again. Is it worth it or not for LLVM here - I don't know neither. On the one hand we highly-likely should get a small performance improvement by enabling FatLTO instead of ThinLTO - based on the Google paper and my personal experience. On the other hand, FatLTO requires much more build time and memory during the build (especially when we are talking about multiple parallel linking binaries with FatLTO) - for large projects like LLVM it could be a problem. However, as you mentioned above, some kind of caching could neglect this issue but I am not aware enough about limitations. This tradeoff is up to you ;) By the way, here you can an article for @Kobzol about the same dilemma for the Rustc compiler - maybe it will be useful too. Highly recommended reading even it's not directly about LLVM.
Yeah, it would be a good thing to do. Unfortunately, LLVM (Clang) doesn't have such functionality directly in the compiler, so we need to do a bit of trickery with the profiles.
This is VERY interesting and useful observation that at least for the LLVM project, Frontend (FE) PGO is more "cross-platform" than IR PGO. I never benched them for such scenarios - thank you! Even if the official Clang documentation recommends using IR PGO by default - seems like it's not always the case. By the way, maybe trying CSIR PGO would be worth it too.
Did you mean
Thank you for such tests too! I am always very cautious about reusing PGO profiles across different compiler versions since who knows which LLVM internal changes between releases affect PGO profiles quality, etc. It's just safer to generate the profiles per platform but cache them somewhere (and updated them regularly if it's needed). However, here it seems it could be a good option.
Your choice regarding the training workload is great, especially the part with covering different target architectures and different optimization levels - LLVM definitely executes different optimization pipelines in these modes. Did you check these Clang PGO CMake build scripts in the LLVM repo? Even if you decide not to reuse them, maybe some kind of inspiration regarding other aspects could be found there: scripts internal structure, naming, etc.
Personally speaking, I don't expect huge performance wins from this, so I don't think that it's worth right now to spend time on it. I propose to postpone this idea since it could be easily implemented later.
Not exactly related to your comment but maybe you would be interested in this one: llvm/llvm-project#63486 . Here I tried to apply PGO to other LLVM parts. Are these important enough for you or not - I don't know :) Thank you once again for working on it! |
If you want to squish a few percent more performance out of Linux builds, you might want to consider using LLVM's BOLT. Rust language has been doing it for years and while sometimes there were problems with it, they could be solved by temporarily (until the next major release) disabling some of the BOLT optimisations. It's done in an analogous way to PGO and is basically a next stage. |
Thanks for the pointer! I'm not entirely sure if this is necessary though; for the cases where things do differ between the profiling and target environments, the profile hashes do detect a mismatch, so in that case, the execution count from the profile should be entirely ignored. Therefore, if the profile does contain a zero count, it should be a valid one (based on what use cases are included in the training workload of course). So I think it may not be necessary to use that flag. (I haven't tried benchmarking to see if it would make any measurable difference though.) |
I haven't checked recently, but I did try it out a couple of years ago, see #192 (comment). But I think those numbers may have been before enabling Also I remember hearing discussions within LLVM, mentioning that ThinLTO and FullLTO use slightly different optimization pipelines, and there's much less attention spent on the FullLTO pipeline.
I didn't spend time on trying that, so far. I would expect that to behave the same like regular IR PGO with respect to profile reuse, that when there are differences between the targets, the generated IR differs, so the hashes for the profile doesn't match, and the profile data gets discarded.
Oh indeed, yes I meant
Yep, if possible it's of course best to have the profiling and target environments match as closely as possible. In the end I didn't end up using this though; all the PGO'd tools end up built with a stage1 Clang built from the same version.
I have looked at these, and at https://github.com/llvm/llvm-project/blob/llvmorg-20.1.7/llvm/utils/release/build_llvm_release.bat#L436-L447 as well. The former builds LLVMSupport, the latter builds one object file from Clang (but which as a dependency probably ends up building a fair bit of LLVM). For my case (building for 4 architectures and 2 configurations each), I prefer sticking to as few source files as possible, while having as reasonable coverage as possible.
Yeah it's all up to what use cases one can add to the training script. And while more of course is more, I try to keep it as small as possible unless it measurably does make a difference. Your numbers for clangd do seem to help around 20% for that case. But I presume the majority of the work is spent inside regular clang core functions, which probably already are covered by the profile we make by compiling regular C++ code. So while clangd isn't included in the instrumentation and training, it probably benefits from the PGO optimization of the common core code anyway. |
Yep, I'm aware of it, but not prioritizing it at the moment, as it only benefits one platform. (I haven't tried it out and familiarized myself with it yet. And even if it would support multiple platforms, I presume that it'd be like the IR PGO, and would require specific profiles for each target?) If we wanted to spend a little bit more effort to get the linux builds a little bit faster, we could do both IR and Frontend instrumentation/training, use the IR profile for the matching Linux build, and the Frontend profile for the rest, gaining around 6% on that single build configuration. But for simplicity/uniformity (and fairness? :-) ), I didn't opt to do that at this stage at least. |
Haha, no worries with that: BOLT supports only Linux, and primarily only x86-64. AFAIK, Aarch64 is also supported but I suppose it's much less tested compared to x86-64 :) I don't think that these limitations will be resolved in any near future since Meta is not interested much in investing into that since they developed BOLT mainly for optimizing their server fleet which of course is mainly x86-64-based. |
Hi, @mstorsjo! I have tested this patch on the Krita build on Windows. Here are the performance results. Legend:
|
Background
Currently, the llvm-mingw toolchains on Linux and macOS are built with the system default toolchains on those platforms (i.e. GCC on Linux and Apple Clang on macOS). This means that the performance of the toolchains depends on how well those preexisting compilers can optimize the compiler.
The llvm-mingw builds on Windows are built with llvm-mingw (with the just built toolchain, on Linux).
Potential speedup
Clang can be optimized to run significantly faster, if compiled with profile guided optimization, PGO.
With PGO, one first builds the application to optimize (in this case, the compiler itself) with instrumentation enabled. This allows recording which codepaths are hot and which aren't, in typical use. With the instrumented application, one runs a few training workloads, to get a profile of those use cases.
This profile can then be passed as input to a second compile of the applicationi, to optimize according to the information gathered in the profile.
The instrumentation can be done on a couple different levels. When building LLVM, one can pass the parameter
LLVM_BUILD_INSTRUMENTED
to CMake, set to the valuesFrontend
orIR
. (There are also other alterantives such asCSIR
,CSSPGO
etc - see https://aaupov.github.io/blog/2023/07/09/pgo for more details.)The Clang performance of different build configurations have been benched with https://github.com/mstorsjo/llvm-project/commits/gha-clang-perf-pgo, with results at https://github.com/mstorsjo/llvm-project/actions/runs/15560431244/job/43827179458.
The benchmark is building Clang 20.1.6 with the Ubuntu 24.04 provided distro compilers, GCC 13 and Clang 18, benchmarking compiling sqlite.
So with such a build, we could make the compiler significantly faster.
Note that enabling LTO on its own gives fairly modest speedups, while enabling it together with PGO gives quite notable speedups.
Building with LTO can make the build a fair bit slower, when linking multiple executables, as the same intermediate objects end up compiled as part of many executables. (The LTO cache may help a bit with this though.) But as long as the build is done with dylib enabled, the majority of the code is only linked into one executable/dylib, so an LTO build still takes roughly as long as a non-LTO build (only roughly 11% longer in one test). And as long as one uses ThinLTO instead of Full LTO, it can use multiple cores efficiently.
Potential drawbacks
While having a faster compiler always is nice, a build like this does come with some drawbacks. There's more potential for hitting various bugs in the toolchains (and when one does that, it becomes much harder to look into those bugs). And the whole build procedure takes longer. The full build procedure ends up taking 3 LLVM builds in series (with many builds happening in parallel), where it previously was 2 LLVM builds in series.
Feasability of PGO
Building with PGO requires building LLVM twice. And if one wants the optimized build to be made with the current LLVM version (rather than a preexisting older toolchain), it needs to be built three times. This is somewhat unwieldy for how llvm-mingw is built.
The llvm-mingw releases consists of 7 different builds (4 different Windows builds, 2 different Linux builds and 1 universal build for macOS).
Additionally, the way llvm-mingw is built, it is very much focused on being cross compiled; all the Windows releases are built on Linux. (The aarch64 Linux releases are also currently cross compiled from x86_64 Linux, although that's possibly subject to change.) These toolchains can be built for architectures we can't execute at all in the build environment.
So doing PGO the usual way, by building natively where the toolchain should be executed, profiling and rebuilding with that profile, isn't really feasible here. (And even if it would be, we wouldn't want to do 2x-3x the amount of building, even if it is free on github.)
So ideally, we would do profiling once, and reuse that single profile for all the optimized builds for all platforms. As far as I know, it is not really documented whether this is supported and to what extent it works.
I did a benchmark where I build an instrumented Clang on Linux and gather a profile of this, and apply it on a cross build for Windows, and benchmark the performance of this: https://github.com/mstorsjo/llvm-project/commits/gha-clang-cross-pgo and https://github.com/mstorsjo/llvm-project/actions/runs/15584009523/job/43903006091
(Compared with the previous table, the baseline here is a build with Clang, not GCC, so the relative speedup numbers are slightly smaller than above.)
First we can note that the regular and LTO cases give roughly the same absolute performance as in the benchmark above (which has to be checked before comparing other numbers, when benchmarks are made on a random assigned CI runner).
Here we can note that Frontend instrumentation gives roughly the same speedup as in the previous benchmark on Linux. IR instrumentation does give some speedup, but it does perform worse than Frontend. In absolute terms, it looks like the Frontend PGO case performs very marginally worse than in a non-cross case, but only by very little.
So with that it does seem like a Frontend instrumentation profile seems very reasonable to use as single profile for all the optimized builds.
Another unusual way of stretching the PGO build, is if the instrumentation/profiling is done with a different version of the compiler than used for the optimized build. This could potentially allow skipping one stage of builds entirely, e.g. doing the profile with an older distro provided Clang, while using it for cross building for Windows with the current Clang, or for building for macOS with an older Apple Clang.
For benchmarks of this scenario, see https://github.com/mstorsjo/llvm-project/commits/gha-clang-perf-pgo-mismatch and https://github.com/mstorsjo/llvm-project/actions/runs/15584439612/job/43903093163.
Here I do single-stage builds with the distro compilers from Ubuntu 22.04 (GCC 11 and Clang 14), and profiling with Clang 14 and optimized builds with Clang 20.
So yet once again, we see that Frontend instrumentation works with almost full efficiency when mixing compiler versions, while IR instrumentation fails to work in such a situation.
Another minor detail to note, is that Frontend instrumentation takes a little bit longer to build than IR instrumentation, around 11%.
So overall, it seems like Frontend instrumentation is very usable, if we want to reuse profiles for different targets, or across compiler versions. It doesn't achieve the same performance as IR instrumentation, but the difference between the two isn't very big (around 6%). But for doing a single build for the current environment, IR instrumentation does give the best results out of these two alterantives.
Training
For doing the profiling, I've picked a small set of test compiles. I compile sqlite (one single C translation unit with lots of opportunities for the compiler to optimize), one large testcase from the libc++ testsuite (exercising compiling C++ templates) and one small C++ testcase from my testsuite.
Because compilation with/without optimizations can run different code generation codepaths, I profile compilation both with and without optimization.
And as compilation for different architectures run different target specific code, I test compile these samples for all the supported architectures.
As part of compiling the samples, I also do linking, to provide some data on what codepaths in the linkers are hot. (One could consider sampling LTO compilation as well, if that hits some otherwise unused codepaths as well, but I'm not sure how big difference it would make.)
Putting it all together
In practice, I've settled on always doing the PGO builds with a "stage 1" build of the current version of Clang, even if it would be possible to do it with a distro provided older version of Clang. Using a current build as stage 1 compiler makes the various builds more consistent, and it turned out that skipping the first stage didn't reduce the total end-to-end build time of the toolchain anyway.
The proposed setup first builds a full (almost) llvm-mingw toolchain on Linux, like before. It skips components like LLDB and clang-tools-extra which aren't needed at this point, but it does build all the mingw runtimes, allowing using this toolchain for cross compilation for Windows.
Then secondly, the first stage compiler is used to build Clang with instrumentation. This is used, together with the headers/libs from the first stage, to test compile for all the various architectures, to gather the profile.
Finally, at this point, the profile and the stage 1 toolchain is used to build the final toolchains, both for Linux and Windows. On macOS, we first do a plain LLVM (no mingw runtimes) build, wait for the profile from Linux, then do a full build including runtimes, with that profile.
I did consider doing the profiling and Linux builds with a distro provided older Clang, however that didn't turn out to make the end to end time of the build any shorter, and it does come with some drawbacks. If we don't have a preexisting llvm-mingw toolchain, we can't do the profiling/training by compiling for all architectures, we'd have to limit us to doing compilation for the native Linux system. And once we do have that profile, we still need to wait for the second stage (PGO build) to have an actually usable compiler for cross compiling the Windows toolchains anyway. (We could build the mingw runtimes with the instrumented compiler, if we wanted to, but the instrumentation makes the compiler too slow for practically using for cross compiling the whole toolchain.)
In addition to building with PGO, this series of changes also enables building with ThinLTO.
Build script changes
This branch adds support in the lower level build scripts (primarily
build-llvm.sh
) for doing the various stages of PGO. These stages can also be run frombuild-all.sh
. Finally there's a toplevel flag for doing all stages in one invocation (for locally doing a fully optimized PGO build).For the changes to the build scripts, I'd be happy to take suggestions on how to adjust the interfaces, to make them as sensible and flexible as possible. CC @cjacek @alvinhochun @mati865 @jeremyd2019 @zamazan4ik @Andarwinux @longnguyen2004.
To do a PGO build of the toolchain locally, one can now do
./build-all.sh --full-lto /opt/llvm-mingw-stage1 /opt/llvm-mingw-pgo
. As the build requires doing two separate toolchains, this now requires providing two destination paths. This isn't entirely ideal, as it makes it harder to optionally pass the--full-lto
flag. (I'm not entirely satisfied with the name--full-lto
either, suggestions welcome.)This internally calls
build-all.sh
three times internally; first./build-all.sh --stage1 /opt/llvm-mingw-stage1
, for a build with mingw runtimes, without LLDB and such. Then./build-all.sh --profile /opt/llvm-mingw-stage1
, which builds and generatesprofile.profdata
- using the toolchain in that directory - but it doesn't install anything into that directory. Then finally./build-all.sh --pgo /opt/llvm-mingw-stage1 /opt/llvm-mingw-pgo
, for the final, PGO optimized build step.Internally,
./build-all.sh --profile
calls./build-all.sh --instrumented /tmp/dummy-prefix
, asbuild-llvm.sh
takes a parameter for a destination to install into, even if we don't install anything at this point. This is mildly ugly. The differing option names,--instrumented
vs--profile
isn't very pretty either, but--instrumented
is more precise for what thatbuild-llvm.sh
step does, while--profile
is a better high level name.All in all, this PR fixes #383.