-
Notifications
You must be signed in to change notification settings - Fork 410
Add paramters for controlling CArena defragmentation #4479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The issue appears to be |
On Frontier, GPU aware MPI does not work well with defragmentation of the comms arena. This does not seem be an issue on Perlmutter. If GPU aware mpi is not used, defragmantatation or not does not seem to matter. |
Is there a reproducer that can be shared? I would like to test it on LUMI |
You need to clone https://github.com/AMReX-Astro/Castro and https://github.com/AMReX-Astro/Microphysics. Then you can use https://github.com/WeiqunZhang/amrex-benchmarks/tree/main/castro/frontier as a template. |
For amd arch, you can either define an environment variable |
What I have observed on perlmutter is interesting and puzzling. I ran the Castro test on 32 Perlmutter nodes. #4451 made "Run time without initialization" decrease from 16.9 to 14.0. The time difference came entirely from difference in ParallelCopy according to tiny profiler. So #4451's defragmentation made parallel communication faster on perlmutter, but slower on frontier. The test ran for 10 steps. Before #4451, the time per step on perlmutter are shown below
After #4451,
The difference comes from Step 2. I also looked Coarse TimeStep time on frontier. Before #4451
After #4451,
There was nothing special about Step 2 on frontier. |
What I have previously observed on the JUWELS Booster system is that it takes a long time for the MPI library (openMPI) to register a piece of GPU memory. This registering can work based on cuda allocations. My solution was to have a large arena init size so there only is one allocation, add environment variables to force the registering of the full allocation and add a dummy mpi call in the init, so all ranks can register memory at the same time.
|
on Perlmutter we often need to set the initial arena size to 0 or else we crash. |
crash from out of memory or from GPU aware MPI? |
out of memory |
I did some tests on Lumi, but I never got a 20% performance difference between everything I tested, only 7%. The environment variables |
The defragmentation in #4451 has caused performance regression for some applications on Frontier. For example, a Castro test had a 20% performance hit. The issue appears to be that the GPU aware MPI on Frontier does not work well with defragmenting the comms arena. In this PR, we introduce a number of runtime parameters and disable defragmentation for the comms arena by default.