Skip to content

Add paramters for controlling CArena defragmentation #4479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 11, 2025

Conversation

WeiqunZhang
Copy link
Member

@WeiqunZhang WeiqunZhang commented May 22, 2025

The defragmentation in #4451 has caused performance regression for some applications on Frontier. For example, a Castro test had a 20% performance hit. The issue appears to be that the GPU aware MPI on Frontier does not work well with defragmenting the comms arena. In this PR, we introduce a number of runtime parameters and disable defragmentation for the comms arena by default.

@WeiqunZhang
Copy link
Member Author

The issue appears to be the_comms_arena.

@WeiqunZhang
Copy link
Member Author

On Frontier, GPU aware MPI does not work well with defragmentation of the comms arena. This does not seem be an issue on Perlmutter. If GPU aware mpi is not used, defragmantatation or not does not seem to matter.

@WeiqunZhang WeiqunZhang marked this pull request as ready for review May 22, 2025 01:52
@AlexanderSinn
Copy link
Member

Is there a reproducer that can be shared? I would like to test it on LUMI

@WeiqunZhang
Copy link
Member Author

@WeiqunZhang
Copy link
Member Author

For amd arch, you can either define an environment variable AMREX_AMD_ARCH or pass AMD_ARCH=gfx90a to make.

@WeiqunZhang
Copy link
Member Author

What I have observed on perlmutter is interesting and puzzling. I ran the Castro test on 32 Perlmutter nodes. #4451 made "Run time without initialization" decrease from 16.9 to 14.0. The time difference came entirely from difference in ParallelCopy according to tiny profiler. So #4451's defragmentation made parallel communication faster on perlmutter, but slower on frontier. The test ran for 10 steps. Before #4451, the time per step on perlmutter are shown below

[STEP 1] Coarse TimeStep time: 1.439480167
[STEP 2] Coarse TimeStep time: 4.257303924
[STEP 3] Coarse TimeStep time: 1.566155951
[STEP 4] Coarse TimeStep time: 1.238451829
[STEP 5] Coarse TimeStep time: 1.4742941
[STEP 6] Coarse TimeStep time: 1.306413112
[STEP 7] Coarse TimeStep time: 1.485467906
[STEP 8] Coarse TimeStep time: 1.319416153
[STEP 9] Coarse TimeStep time: 1.476840293
[STEP 10] Coarse TimeStep time: 1.304275601

After #4451,

[STEP 1] Coarse TimeStep time: 1.439000445
[STEP 2] Coarse TimeStep time: 1.266900538
[STEP 3] Coarse TimeStep time: 1.557451678
[STEP 4] Coarse TimeStep time: 1.236646927
[STEP 5] Coarse TimeStep time: 1.482054124
[STEP 6] Coarse TimeStep time: 1.320941828
[STEP 7] Coarse TimeStep time: 1.488473038
[STEP 8] Coarse TimeStep time: 1.33447745
[STEP 9] Coarse TimeStep time: 1.47504206
[STEP 10] Coarse TimeStep time: 1.318964848

The difference comes from Step 2.

I also looked Coarse TimeStep time on frontier. Before #4451

[STEP 1] Coarse TimeStep time: 4.136739239
[STEP 2] Coarse TimeStep time: 2.851410225
[STEP 3] Coarse TimeStep time: 3.259291452
[STEP 4] Coarse TimeStep time: 2.778606403
[STEP 5] Coarse TimeStep time: 3.112101691
[STEP 6] Coarse TimeStep time: 2.875779648
[STEP 7] Coarse TimeStep time: 3.14495843
[STEP 8] Coarse TimeStep time: 2.895178096
[STEP 9] Coarse TimeStep time: 3.138442613
[STEP 10] Coarse TimeStep time: 2.87714859

After #4451,

[STEP 1] Coarse TimeStep time: 3.94065505
[STEP 2] Coarse TimeStep time: 3.289660751
[STEP 3] Coarse TimeStep time: 3.718597346
[STEP 4] Coarse TimeStep time: 3.434589869
[STEP 5] Coarse TimeStep time: 3.70037222
[STEP 6] Coarse TimeStep time: 3.43899539
[STEP 7] Coarse TimeStep time: 3.678903421
[STEP 8] Coarse TimeStep time: 3.433382475
[STEP 9] Coarse TimeStep time: 3.705777435
[STEP 10] Coarse TimeStep time: 3.483618459

There was nothing special about Step 2 on frontier.

@zingale @AlexanderSinn

@AlexanderSinn
Copy link
Member

AlexanderSinn commented May 23, 2025

What I have previously observed on the JUWELS Booster system is that it takes a long time for the MPI library (openMPI) to register a piece of GPU memory. This registering can work based on cuda allocations. My solution was to have a large arena init size so there only is one allocation, add environment variables to force the registering of the full allocation and add a dummy mpi call in the init, so all ranks can register memory at the same time.

export UCX_CUDA_COPY_REG_WHOLE_ALLOC=on
export UCX_CUDA_COPY_MAX_REG_RATIO=0
export UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda
export UCX_MEMTYPE_CACHE=n

@zingale
Copy link
Member

zingale commented May 23, 2025

on Perlmutter we often need to set the initial arena size to 0 or else we crash.

@AlexanderSinn
Copy link
Member

crash from out of memory or from GPU aware MPI?

@zingale
Copy link
Member

zingale commented May 23, 2025

out of memory

@AlexanderSinn
Copy link
Member

I did some tests on Lumi, but I never got a 20% performance difference between everything I tested, only 7%. The environment variables export MPICH_GPU_EAGER_DEVICE_MEM=1 and export FI_MR_CACHE_MONITOR=memhooks seemed interesting, but I am not sure if they improved performance in the end. Setting the comms arena init size to 3 GB so that it won't need extra allocations seemed to work. Considering this, it's good to have the option to disable defragmentation to have more things to try to fix performance.

@atmyers atmyers merged commit 2d47d39 into AMReX-Codes:development Jun 11, 2025
75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants