Add paramters for controlling CArena defragmentation #4479

WeiqunZhang · 2025-05-22T00:37:10Z

The defragmentation in #4451 has caused performance regression for some applications on Frontier. For example, a Castro test had a 20% performance hit. The issue appears to be that the GPU aware MPI on Frontier does not work well with defragmenting the comms arena. In this PR, we introduce a number of runtime parameters and disable defragmentation for the comms arena by default.

WeiqunZhang · 2025-05-22T00:55:31Z

The issue appears to be the_comms_arena.

WeiqunZhang · 2025-05-22T01:17:45Z

On Frontier, GPU aware MPI does not work well with defragmentation of the comms arena. This does not seem be an issue on Perlmutter. If GPU aware mpi is not used, defragmantatation or not does not seem to matter.

AlexanderSinn · 2025-05-22T10:20:39Z

Is there a reproducer that can be shared? I would like to test it on LUMI

WeiqunZhang · 2025-05-22T16:17:07Z

You need to clone https://github.com/AMReX-Astro/Castro and https://github.com/AMReX-Astro/Microphysics. Then you can use https://github.com/WeiqunZhang/amrex-benchmarks/tree/main/castro/frontier as a template.

WeiqunZhang · 2025-05-22T16:19:51Z

For amd arch, you can either define an environment variable AMREX_AMD_ARCH or pass AMD_ARCH=gfx90a to make.

WeiqunZhang · 2025-05-23T15:03:31Z

What I have observed on perlmutter is interesting and puzzling. I ran the Castro test on 32 Perlmutter nodes. #4451 made "Run time without initialization" decrease from 16.9 to 14.0. The time difference came entirely from difference in ParallelCopy according to tiny profiler. So #4451's defragmentation made parallel communication faster on perlmutter, but slower on frontier. The test ran for 10 steps. Before #4451, the time per step on perlmutter are shown below

[STEP 1] Coarse TimeStep time: 1.439480167
[STEP 2] Coarse TimeStep time: 4.257303924
[STEP 3] Coarse TimeStep time: 1.566155951
[STEP 4] Coarse TimeStep time: 1.238451829
[STEP 5] Coarse TimeStep time: 1.4742941
[STEP 6] Coarse TimeStep time: 1.306413112
[STEP 7] Coarse TimeStep time: 1.485467906
[STEP 8] Coarse TimeStep time: 1.319416153
[STEP 9] Coarse TimeStep time: 1.476840293
[STEP 10] Coarse TimeStep time: 1.304275601

After #4451,

[STEP 1] Coarse TimeStep time: 1.439000445
[STEP 2] Coarse TimeStep time: 1.266900538
[STEP 3] Coarse TimeStep time: 1.557451678
[STEP 4] Coarse TimeStep time: 1.236646927
[STEP 5] Coarse TimeStep time: 1.482054124
[STEP 6] Coarse TimeStep time: 1.320941828
[STEP 7] Coarse TimeStep time: 1.488473038
[STEP 8] Coarse TimeStep time: 1.33447745
[STEP 9] Coarse TimeStep time: 1.47504206
[STEP 10] Coarse TimeStep time: 1.318964848

The difference comes from Step 2.

I also looked Coarse TimeStep time on frontier. Before #4451

[STEP 1] Coarse TimeStep time: 4.136739239
[STEP 2] Coarse TimeStep time: 2.851410225
[STEP 3] Coarse TimeStep time: 3.259291452
[STEP 4] Coarse TimeStep time: 2.778606403
[STEP 5] Coarse TimeStep time: 3.112101691
[STEP 6] Coarse TimeStep time: 2.875779648
[STEP 7] Coarse TimeStep time: 3.14495843
[STEP 8] Coarse TimeStep time: 2.895178096
[STEP 9] Coarse TimeStep time: 3.138442613
[STEP 10] Coarse TimeStep time: 2.87714859

After #4451,

[STEP 1] Coarse TimeStep time: 3.94065505
[STEP 2] Coarse TimeStep time: 3.289660751
[STEP 3] Coarse TimeStep time: 3.718597346
[STEP 4] Coarse TimeStep time: 3.434589869
[STEP 5] Coarse TimeStep time: 3.70037222
[STEP 6] Coarse TimeStep time: 3.43899539
[STEP 7] Coarse TimeStep time: 3.678903421
[STEP 8] Coarse TimeStep time: 3.433382475
[STEP 9] Coarse TimeStep time: 3.705777435
[STEP 10] Coarse TimeStep time: 3.483618459

There was nothing special about Step 2 on frontier.

@zingale @AlexanderSinn

AlexanderSinn · 2025-05-23T15:26:52Z

What I have previously observed on the JUWELS Booster system is that it takes a long time for the MPI library (openMPI) to register a piece of GPU memory. This registering can work based on cuda allocations. My solution was to have a large arena init size so there only is one allocation, add environment variables to force the registering of the full allocation and add a dummy mpi call in the init, so all ranks can register memory at the same time.

export UCX_CUDA_COPY_REG_WHOLE_ALLOC=on
export UCX_CUDA_COPY_MAX_REG_RATIO=0
export UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda
export UCX_MEMTYPE_CACHE=n

zingale · 2025-05-23T15:28:33Z

on Perlmutter we often need to set the initial arena size to 0 or else we crash.

AlexanderSinn · 2025-05-23T15:32:17Z

crash from out of memory or from GPU aware MPI?

zingale · 2025-05-23T15:38:56Z

out of memory

AlexanderSinn · 2025-05-24T13:32:05Z

I did some tests on Lumi, but I never got a 20% performance difference between everything I tested, only 7%. The environment variables export MPICH_GPU_EAGER_DEVICE_MEM=1 and export FI_MR_CACHE_MONITOR=memhooks seemed interesting, but I am not sure if they improved performance in the end. Setting the comms arena init size to 3 GB so that it won't need extra allocations seemed to work. Considering this, it's good to have the option to disable defragmentation to have more things to try to fix performance.

Add paramters for controlling CArena defragmentation

b54fb29

WeiqunZhang mentioned this pull request May 22, 2025

Avoid freeing GPU memory inside MFIter loop #4478

Closed

No defragmentation for the comms arena

8895772

WeiqunZhang marked this pull request as ready for review May 22, 2025 01:52

WeiqunZhang requested review from AlexanderSinn and atmyers May 22, 2025 01:53

Change the default for non-hip builds

2336998

AlexanderSinn approved these changes May 27, 2025

View reviewed changes

atmyers approved these changes Jun 11, 2025

View reviewed changes

atmyers merged commit 2d47d39 into AMReX-Codes:development Jun 11, 2025
75 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add paramters for controlling CArena defragmentation #4479

Add paramters for controlling CArena defragmentation #4479

Uh oh!

WeiqunZhang commented May 22, 2025 •

edited

Loading

Uh oh!

WeiqunZhang commented May 22, 2025

Uh oh!

WeiqunZhang commented May 22, 2025

Uh oh!

AlexanderSinn commented May 22, 2025

Uh oh!

WeiqunZhang commented May 22, 2025

Uh oh!

WeiqunZhang commented May 22, 2025

Uh oh!

WeiqunZhang commented May 23, 2025

Uh oh!

AlexanderSinn commented May 23, 2025 •

edited

Loading

Uh oh!

zingale commented May 23, 2025

Uh oh!

AlexanderSinn commented May 23, 2025

Uh oh!

zingale commented May 23, 2025

Uh oh!

AlexanderSinn commented May 24, 2025

Uh oh!

Uh oh!

Uh oh!

Add paramters for controlling CArena defragmentation #4479

Add paramters for controlling CArena defragmentation #4479

Uh oh!

Conversation

WeiqunZhang commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeiqunZhang commented May 22, 2025

Uh oh!

WeiqunZhang commented May 22, 2025

Uh oh!

AlexanderSinn commented May 22, 2025

Uh oh!

WeiqunZhang commented May 22, 2025

Uh oh!

WeiqunZhang commented May 22, 2025

Uh oh!

WeiqunZhang commented May 23, 2025

Uh oh!

AlexanderSinn commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zingale commented May 23, 2025

Uh oh!

AlexanderSinn commented May 23, 2025

Uh oh!

zingale commented May 23, 2025

Uh oh!

AlexanderSinn commented May 24, 2025

Uh oh!

Uh oh!

Uh oh!

WeiqunZhang commented May 22, 2025 •

edited

Loading

AlexanderSinn commented May 23, 2025 •

edited

Loading