Skip to content

Building the toolchain with PGO #503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open

Building the toolchain with PGO #503

wants to merge 20 commits into from

Conversation

mstorsjo
Copy link
Owner

Background

Currently, the llvm-mingw toolchains on Linux and macOS are built with the system default toolchains on those platforms (i.e. GCC on Linux and Apple Clang on macOS). This means that the performance of the toolchains depends on how well those preexisting compilers can optimize the compiler.

The llvm-mingw builds on Windows are built with llvm-mingw (with the just built toolchain, on Linux).

Potential speedup

Clang can be optimized to run significantly faster, if compiled with profile guided optimization, PGO.

With PGO, one first builds the application to optimize (in this case, the compiler itself) with instrumentation enabled. This allows recording which codepaths are hot and which aren't, in typical use. With the instrumented application, one runs a few training workloads, to get a profile of those use cases.

This profile can then be passed as input to a second compile of the applicationi, to optimize according to the information gathered in the profile.

The instrumentation can be done on a couple different levels. When building LLVM, one can pass the parameter LLVM_BUILD_INSTRUMENTED to CMake, set to the values Frontend or IR. (There are also other alterantives such as CSIR, CSSPGO etc - see https://aaupov.github.io/blog/2023/07/09/pgo for more details.)

The Clang performance of different build configurations have been benched with https://github.com/mstorsjo/llvm-project/commits/gha-clang-perf-pgo, with results at https://github.com/mstorsjo/llvm-project/actions/runs/15560431244/job/43827179458.

The benchmark is building Clang 20.1.6 with the Ubuntu 24.04 provided distro compilers, GCC 13 and Clang 18, benchmarking compiling sqlite.

time speedup
GCC 20.541 0%
Clang 20.213 1%
Clang, LTO 18.535 10%
Clang, PGO(Frontend) 15.758 30%
Clang, PGO(IR) 15.040 36%
Clang, LTO+PGO(Frontend) 14.625 40%
Clang, LTO+PGO(IR) 13.753 49%

So with such a build, we could make the compiler significantly faster.

Note that enabling LTO on its own gives fairly modest speedups, while enabling it together with PGO gives quite notable speedups.

Building with LTO can make the build a fair bit slower, when linking multiple executables, as the same intermediate objects end up compiled as part of many executables. (The LTO cache may help a bit with this though.) But as long as the build is done with dylib enabled, the majority of the code is only linked into one executable/dylib, so an LTO build still takes roughly as long as a non-LTO build (only roughly 11% longer in one test). And as long as one uses ThinLTO instead of Full LTO, it can use multiple cores efficiently.

Potential drawbacks

While having a faster compiler always is nice, a build like this does come with some drawbacks. There's more potential for hitting various bugs in the toolchains (and when one does that, it becomes much harder to look into those bugs). And the whole build procedure takes longer. The full build procedure ends up taking 3 LLVM builds in series (with many builds happening in parallel), where it previously was 2 LLVM builds in series.

Feasability of PGO

Building with PGO requires building LLVM twice. And if one wants the optimized build to be made with the current LLVM version (rather than a preexisting older toolchain), it needs to be built three times. This is somewhat unwieldy for how llvm-mingw is built.

The llvm-mingw releases consists of 7 different builds (4 different Windows builds, 2 different Linux builds and 1 universal build for macOS).

Additionally, the way llvm-mingw is built, it is very much focused on being cross compiled; all the Windows releases are built on Linux. (The aarch64 Linux releases are also currently cross compiled from x86_64 Linux, although that's possibly subject to change.) These toolchains can be built for architectures we can't execute at all in the build environment.

So doing PGO the usual way, by building natively where the toolchain should be executed, profiling and rebuilding with that profile, isn't really feasible here. (And even if it would be, we wouldn't want to do 2x-3x the amount of building, even if it is free on github.)

So ideally, we would do profiling once, and reuse that single profile for all the optimized builds for all platforms. As far as I know, it is not really documented whether this is supported and to what extent it works.

I did a benchmark where I build an instrumented Clang on Linux and gather a profile of this, and apply it on a cross build for Windows, and benchmark the performance of this: https://github.com/mstorsjo/llvm-project/commits/gha-clang-cross-pgo and https://github.com/mstorsjo/llvm-project/actions/runs/15584009523/job/43903006091

time speedup
Regular 20.265 0%
LTO 18.834 7%
PGO(Frontend) 16.140 25%
PGO(IR) 17.494 15%
LTO+PGO(Frontend) 14.991 35%
LTO+PGO(IR) 16.347 23%

(Compared with the previous table, the baseline here is a build with Clang, not GCC, so the relative speedup numbers are slightly smaller than above.)

First we can note that the regular and LTO cases give roughly the same absolute performance as in the benchmark above (which has to be checked before comparing other numbers, when benchmarks are made on a random assigned CI runner).

Here we can note that Frontend instrumentation gives roughly the same speedup as in the previous benchmark on Linux. IR instrumentation does give some speedup, but it does perform worse than Frontend. In absolute terms, it looks like the Frontend PGO case performs very marginally worse than in a non-cross case, but only by very little.

So with that it does seem like a Frontend instrumentation profile seems very reasonable to use as single profile for all the optimized builds.

Another unusual way of stretching the PGO build, is if the instrumentation/profiling is done with a different version of the compiler than used for the optimized build. This could potentially allow skipping one stage of builds entirely, e.g. doing the profile with an older distro provided Clang, while using it for cross building for Windows with the current Clang, or for building for macOS with an older Apple Clang.

For benchmarks of this scenario, see https://github.com/mstorsjo/llvm-project/commits/gha-clang-perf-pgo-mismatch and https://github.com/mstorsjo/llvm-project/actions/runs/15584439612/job/43903093163.

Here I do single-stage builds with the distro compilers from Ubuntu 22.04 (GCC 11 and Clang 14), and profiling with Clang 14 and optimized builds with Clang 20.

time speedup
GCC 11 19.664 0%
Clang 14 20.008 -1%
Clang 14, LTO 18.547 6%
Clang 14+20, PGO(Frontend) 15.444 27%
Clang 14+20, PGO(IR) 17.872 10%
Clang 14+20, LTO+PGO(Frontend) 14.468 35%
Clang 14+20, LTO+PGO(IR) 16.754 17%

So yet once again, we see that Frontend instrumentation works with almost full efficiency when mixing compiler versions, while IR instrumentation fails to work in such a situation.

Another minor detail to note, is that Frontend instrumentation takes a little bit longer to build than IR instrumentation, around 11%.

So overall, it seems like Frontend instrumentation is very usable, if we want to reuse profiles for different targets, or across compiler versions. It doesn't achieve the same performance as IR instrumentation, but the difference between the two isn't very big (around 6%). But for doing a single build for the current environment, IR instrumentation does give the best results out of these two alterantives.

Training

For doing the profiling, I've picked a small set of test compiles. I compile sqlite (one single C translation unit with lots of opportunities for the compiler to optimize), one large testcase from the libc++ testsuite (exercising compiling C++ templates) and one small C++ testcase from my testsuite.

Because compilation with/without optimizations can run different code generation codepaths, I profile compilation both with and without optimization.

And as compilation for different architectures run different target specific code, I test compile these samples for all the supported architectures.

As part of compiling the samples, I also do linking, to provide some data on what codepaths in the linkers are hot. (One could consider sampling LTO compilation as well, if that hits some otherwise unused codepaths as well, but I'm not sure how big difference it would make.)

Putting it all together

In practice, I've settled on always doing the PGO builds with a "stage 1" build of the current version of Clang, even if it would be possible to do it with a distro provided older version of Clang. Using a current build as stage 1 compiler makes the various builds more consistent, and it turned out that skipping the first stage didn't reduce the total end-to-end build time of the toolchain anyway.

The proposed setup first builds a full (almost) llvm-mingw toolchain on Linux, like before. It skips components like LLDB and clang-tools-extra which aren't needed at this point, but it does build all the mingw runtimes, allowing using this toolchain for cross compilation for Windows.

Then secondly, the first stage compiler is used to build Clang with instrumentation. This is used, together with the headers/libs from the first stage, to test compile for all the various architectures, to gather the profile.

Finally, at this point, the profile and the stage 1 toolchain is used to build the final toolchains, both for Linux and Windows. On macOS, we first do a plain LLVM (no mingw runtimes) build, wait for the profile from Linux, then do a full build including runtimes, with that profile.

I did consider doing the profiling and Linux builds with a distro provided older Clang, however that didn't turn out to make the end to end time of the build any shorter, and it does come with some drawbacks. If we don't have a preexisting llvm-mingw toolchain, we can't do the profiling/training by compiling for all architectures, we'd have to limit us to doing compilation for the native Linux system. And once we do have that profile, we still need to wait for the second stage (PGO build) to have an actually usable compiler for cross compiling the Windows toolchains anyway. (We could build the mingw runtimes with the instrumented compiler, if we wanted to, but the instrumentation makes the compiler too slow for practically using for cross compiling the whole toolchain.)

In addition to building with PGO, this series of changes also enables building with ThinLTO.

Build script changes

This branch adds support in the lower level build scripts (primarily build-llvm.sh) for doing the various stages of PGO. These stages can also be run from build-all.sh. Finally there's a toplevel flag for doing all stages in one invocation (for locally doing a fully optimized PGO build).

For the changes to the build scripts, I'd be happy to take suggestions on how to adjust the interfaces, to make them as sensible and flexible as possible. CC @cjacek @alvinhochun @mati865 @jeremyd2019 @zamazan4ik @Andarwinux @longnguyen2004.

To do a PGO build of the toolchain locally, one can now do ./build-all.sh --full-lto /opt/llvm-mingw-stage1 /opt/llvm-mingw-pgo. As the build requires doing two separate toolchains, this now requires providing two destination paths. This isn't entirely ideal, as it makes it harder to optionally pass the --full-lto flag. (I'm not entirely satisfied with the name --full-lto either, suggestions welcome.)

This internally calls build-all.sh three times internally; first ./build-all.sh --stage1 /opt/llvm-mingw-stage1, for a build with mingw runtimes, without LLDB and such. Then ./build-all.sh --profile /opt/llvm-mingw-stage1, which builds and generates profile.profdata - using the toolchain in that directory - but it doesn't install anything into that directory. Then finally ./build-all.sh --pgo /opt/llvm-mingw-stage1 /opt/llvm-mingw-pgo, for the final, PGO optimized build step.

Internally, ./build-all.sh --profile calls ./build-all.sh --instrumented /tmp/dummy-prefix, as build-llvm.sh takes a parameter for a destination to install into, even if we don't install anything at this point. This is mildly ugly. The differing option names, --instrumented vs --profile isn't very pretty either, but --instrumented is more precise for what that build-llvm.sh step does, while --profile is a better high level name.

All in all, this PR fixes #383.

mstorsjo added 20 commits June 11, 2025 16:48
Don't use tools from $PREFIX/bin automatically; this can be
confusing and problematic. If running with $PREFIX/bin in path,
the build at this stage gets reconfigured if rebuilding after
installing clang.

For multistage builds, install into separate toolchain
directories and add the right one to $PATH before invoking
build-llvm.sh. If one really wants to use the tools from $PREFIX
for a new build into the same directory, one has to add that
directory to the path manually.
When cross compiling with llvm-mingw, we already implicitly use
clang and lld, but we don't need to specify these options as they
are set implicitly via the llvm-mingw wrappers.
This augments the result of an earlier build, with tools required
for building for macOS with the tools built with build-llvm.sh.
Passing --instrumented=IR or --instrumented=Frontend builds
instrumented tool binaries. After running them and merging the
output profile files, run build-llvm.sh with
--profile=profile.profdata.
Use a makefile for running a set of commands in parallel.
As build-llvm now can create a larger variety of build dir names,
enumerating all possible suffixes becomes problematic (especially
as they don't necessarily appear in one canonical order).

Instead, iterate over potential matches. If cross compiling, it
is easy as we can require the directory to contain the expected
suffix. If not cross compiling, we have no explicit suffix to look
for, but we can check that the directory doesn't match common
cross compilation triples.
Do the same as when cross compiling; disallow detecting dependencies
outside of the given CMAKE_FIND_ROOT_PATH. Don't export LLVM_DIR
as we no longer need to pass it implicitly that way.

This makes the errors clearer, if LLVM isn't found where expected,
instead of misdetecting it from a different installation of LLVM
in the system.

If lldb-mi would try to find other optional dependencies, they would
no longer be found; this was already the case when cross compiling
though.
Allow both doing one single stage at a time, or driving the
full build with three stages all in one command.
This makes a cross compiler available before doing the full second
stage.
Using Frontend instrumentation rather than IR. This gives less
total speedup, but makes the profile much more usable for cross
other build targets.

Build the full Linux toolchain in stage1; this avoids needing to
rebuild runtimes for doing the profiling in stage2, and avoids
needing to wait for stage3 to complete before cross building
for Windows.
…olchain

This makes the docker image contain a PGO optimized compiler,
without needing to redo all of the PGO build stages in docker.

This reverts commit 8f42112,
and fixes building the multiplatform image in a simpler way.

Instead of doing two separate builds, with separate Dockerfiles,
just do one multiplatform build of one Dockerfile, which
docker runs serially, for each one of the included architectures,
just packaging the prebuilt toolchains in the docker image.
@Andarwinux
Copy link
Contributor

Building with PGO requires building LLVM twice. And if one wants the optimized build to be made with the current LLVM version (rather than a preexisting older toolchain), it needs to be built three times. This is somewhat unwieldy for how llvm-mingw is built.

The llvm-mingw releases consists of 7 different builds (4 different Windows builds, 2 different Linux builds and 1 universal build for macOS).

Additionally, the way llvm-mingw is built, it is very much focused on being cross compiled; all the Windows releases are built on Linux. (The aarch64 Linux releases are also currently cross compiled from x86_64 Linux, although that's possibly subject to change.) These toolchains can be built for architectures we can't execute at all in the build environment.

So doing PGO the usual way, by building natively where the toolchain should be executed, profiling and rebuilding with that profile, isn't really feasible here. (And even if it would be, we wouldn't want to do 2x-3x the amount of building, even if it is free on github.)

So ideally, we would do profiling once, and reuse that single profile for all the optimized builds for all platforms. As far as I know, it is not really documented whether this is supported and to what extent it works.

For this case, it can be assumed that there is some distortion in the profdata. Maybe use --sparse to merge the profraw to avoid cold code being size optimized?

@zamazan4ik
Copy link

zamazan4ik commented Jun 15, 2025

What a great write-up!

I see here several interesting discussion points so ;)

In addition to building with PGO, this series of changes also enables building with ThinLTO.

Have you done a comparison between FatLTO (aka FullLTO) vs ThinLTO for this use case? I don't have recent benches but luckily Google provided them in the original ThinLTO paper. How actual they are nowadays I don't know due to developments in the LTO field since the paper was published, so probably we can these benchmarks once again. Is it worth it or not for LLVM here - I don't know neither. On the one hand we highly-likely should get a small performance improvement by enabling FatLTO instead of ThinLTO - based on the Google paper and my personal experience. On the other hand, FatLTO requires much more build time and memory during the build (especially when we are talking about multiple parallel linking binaries with FatLTO) - for large projects like LLVM it could be a problem. However, as you mentioned above, some kind of caching could neglect this issue but I am not aware enough about limitations. This tradeoff is up to you ;) By the way, here you can an article for @Kobzol about the same dilemma for the Rustc compiler - maybe it will be useful too. Highly recommended reading even it's not directly about LLVM.

For this case, it can be assumed that there is some distortion in the profdata. Maybe use llvm/llvm-project#63024 (comment) to merge the profraw to avoid cold code being size optimized?

Yeah, it would be a good thing to do. Unfortunately, LLVM (Clang) doesn't have such functionality directly in the compiler, so we need to do a bit of trickery with the profiles.

So with that it does seem like a Frontend instrumentation profile seems very reasonable to use as single profile for all the optimized builds.

This is VERY interesting and useful observation that at least for the LLVM project, Frontend (FE) PGO is more "cross-platform" than IR PGO. I never benched them for such scenarios - thank you! Even if the official Clang documentation recommends using IR PGO by default - seems like it's not always the case. By the way, maybe trying CSIR PGO would be worth it too.

I'm not entirely satisfied with the name --full-lto either, suggestions welcome

Did you mean --full-pgo here? Because in your current PR I see only --full-pgo instead of --full-lto. FullLTO is definitely not a good choice :D (at least because it's already a taken - another name for FatLTO, huh).

Another unusual way of stretching the PGO build, is if the instrumentation/profiling is done with a different version of the compiler than used for the optimized build.

Thank you for such tests too! I am always very cautious about reusing PGO profiles across different compiler versions since who knows which LLVM internal changes between releases affect PGO profiles quality, etc. It's just safer to generate the profiles per platform but cache them somewhere (and updated them regularly if it's needed). However, here it seems it could be a good option.

Training ...

Your choice regarding the training workload is great, especially the part with covering different target architectures and different optimization levels - LLVM definitely executes different optimization pipelines in these modes.

Did you check these Clang PGO CMake build scripts in the LLVM repo? Even if you decide not to reuse them, maybe some kind of inspiration regarding other aspects could be found there: scripts internal structure, naming, etc.

One could consider sampling LTO compilation as well, if that hits some otherwise unused codepaths as well, but I'm not sure how big difference it would make

Personally speaking, I don't expect huge performance wins from this, so I don't think that it's worth right now to spend time on it. I propose to postpone this idea since it could be easily implemented later.

It skips components like LLDB and clang-tools-extra which aren't needed at this point

Not exactly related to your comment but maybe you would be interested in this one: llvm/llvm-project#63486 . Here I tried to apply PGO to other LLVM parts. Are these important enough for you or not - I don't know :)

Thank you once again for working on it!

@mati865
Copy link
Contributor

mati865 commented Jun 15, 2025

If you want to squish a few percent more performance out of Linux builds, you might want to consider using LLVM's BOLT. Rust language has been doing it for years and while sometimes there were problems with it, they could be solved by temporarily (until the next major release) disabling some of the BOLT optimisations.

It's done in an analogous way to PGO and is basically a next stage.

@mstorsjo
Copy link
Owner Author

So ideally, we would do profiling once, and reuse that single profile for all the optimized builds for all platforms. As far as I know, it is not really documented whether this is supported and to what extent it works.

For this case, it can be assumed that there is some distortion in the profdata. Maybe use --sparse to merge the profraw to avoid cold code being size optimized?

Thanks for the pointer! I'm not entirely sure if this is necessary though; for the cases where things do differ between the profiling and target environments, the profile hashes do detect a mismatch, so in that case, the execution count from the profile should be entirely ignored. Therefore, if the profile does contain a zero count, it should be a valid one (based on what use cases are included in the training workload of course). So I think it may not be necessary to use that flag. (I haven't tried benchmarking to see if it would make any measurable difference though.)

@mstorsjo
Copy link
Owner Author

In addition to building with PGO, this series of changes also enables building with ThinLTO.

Have you done a comparison between FatLTO (aka FullLTO) vs ThinLTO for this use case? I don't have recent benches but luckily Google provided them in the original ThinLTO paper. How actual they are nowadays I don't know due to developments in the LTO field since the paper was published, so probably we can these benchmarks once again. Is it worth it or not for LLVM here - I don't know neither.

I haven't checked recently, but I did try it out a couple of years ago, see #192 (comment). But I think those numbers may have been before enabling LLVM_LINK_LLVM_DYLIB, as the time difference between ThinLTO and a regular build shouldn't be all that dramatic. Anyway, back then the difference between FullLTO and ThinLTO was around 1%, so it's probably not worth it. Especially if building on machines with more than a few cores, the difference is staggering.

Also I remember hearing discussions within LLVM, mentioning that ThinLTO and FullLTO use slightly different optimization pipelines, and there's much less attention spent on the FullLTO pipeline.

So with that it does seem like a Frontend instrumentation profile seems very reasonable to use as single profile for all the optimized builds.

This is VERY interesting and useful observation that at least for the LLVM project, Frontend (FE) PGO is more "cross-platform" than IR PGO. I never benched them for such scenarios - thank you! Even if the official Clang documentation recommends using IR PGO by default - seems like it's not always the case. By the way, maybe trying CSIR PGO would be worth it too.

I didn't spend time on trying that, so far. I would expect that to behave the same like regular IR PGO with respect to profile reuse, that when there are differences between the targets, the generated IR differs, so the hashes for the profile doesn't match, and the profile data gets discarded.

I'm not entirely satisfied with the name --full-lto either, suggestions welcome

Did you mean --full-pgo here? Because in your current PR I see only --full-pgo instead of --full-lto. FullLTO is definitely not a good choice :D (at least because it's already a taken - another name for FatLTO, huh).

Oh indeed, yes I meant --full-pgo.

Another unusual way of stretching the PGO build, is if the instrumentation/profiling is done with a different version of the compiler than used for the optimized build.

Thank you for such tests too! I am always very cautious about reusing PGO profiles across different compiler versions since who knows which LLVM internal changes between releases affect PGO profiles quality, etc. It's just safer to generate the profiles per platform but cache them somewhere (and updated them regularly if it's needed). However, here it seems it could be a good option.

Yep, if possible it's of course best to have the profiling and target environments match as closely as possible. In the end I didn't end up using this though; all the PGO'd tools end up built with a stage1 Clang built from the same version.

Did you check these Clang PGO CMake build scripts in the LLVM repo? Even if you decide not to reuse them, maybe some kind of inspiration regarding other aspects could be found there: scripts internal structure, naming, etc.

I have looked at these, and at https://github.com/llvm/llvm-project/blob/llvmorg-20.1.7/llvm/utils/release/build_llvm_release.bat#L436-L447 as well. The former builds LLVMSupport, the latter builds one object file from Clang (but which as a dependency probably ends up building a fair bit of LLVM). For my case (building for 4 architectures and 2 configurations each), I prefer sticking to as few source files as possible, while having as reasonable coverage as possible.

It skips components like LLDB and clang-tools-extra which aren't needed at this point

Not exactly related to your comment but maybe you would be interested in this one: llvm/llvm-project#63486 . Here I tried to apply PGO to other LLVM parts. Are these important enough for you or not - I don't know :)

Yeah it's all up to what use cases one can add to the training script. And while more of course is more, I try to keep it as small as possible unless it measurably does make a difference. Your numbers for clangd do seem to help around 20% for that case. But I presume the majority of the work is spent inside regular clang core functions, which probably already are covered by the profile we make by compiling regular C++ code. So while clangd isn't included in the instrumentation and training, it probably benefits from the PGO optimization of the common core code anyway.

@mstorsjo
Copy link
Owner Author

If you want to squish a few percent more performance out of Linux builds, you might want to consider using LLVM's BOLT. Rust language has been doing it for years and while sometimes there were problems with it, they could be solved by temporarily (until the next major release) disabling some of the BOLT optimisations.

It's done in an analogous way to PGO and is basically a next stage.

Yep, I'm aware of it, but not prioritizing it at the moment, as it only benefits one platform. (I haven't tried it out and familiarized myself with it yet. And even if it would support multiple platforms, I presume that it'd be like the IR PGO, and would require specific profiles for each target?)

If we wanted to spend a little bit more effort to get the linux builds a little bit faster, we could do both IR and Frontend instrumentation/training, use the IR profile for the matching Linux build, and the Frontend profile for the rest, gaining around 6% on that single build configuration. But for simplicity/uniformity (and fairness? :-) ), I didn't opt to do that at this stage at least.

@zamazan4ik
Copy link

I presume that it'd be like the IR PGO, and would require specific profiles for each target

Haha, no worries with that: BOLT supports only Linux, and primarily only x86-64. AFAIK, Aarch64 is also supported but I suppose it's much less tested compared to x86-64 :) I don't think that these limitations will be resolved in any near future since Meta is not interested much in investing into that since they developed BOLT mainly for optimizing their server fleet which of course is mainly x86-64-based.

@dimula73
Copy link

Hi, @mstorsjo!

I have tested this patch on the Krita build on Windows. Here are the performance results.

Legend:
clang-20-pgo -- the build from this MR
clang-20 -- the release build of clang 20.1.6
cland-18 -- the release build of clang 18.1.8 used in Krita

target clang-18 clang-20 clang-20-pgo speedup (pgo/no-pgo), %
kritaimage 1028s 947s 796s 16%
kritaimage+PCH ~600s 557s 472s 15%
full krita with unittests + PCH 4403s 3845s 3183s 17%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable Profile-Guided Optimization (PGO) for packages
5 participants