Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realm Gather Copy Hang #1802

Open
lightsighter opened this issue Dec 6, 2024 · 2 comments
Open

Realm Gather Copy Hang #1802

lightsighter opened this issue Dec 6, 2024 · 2 comments
Assignees
Labels
bug Realm Issues pertaining to Realm

Comments

@lightsighter
Copy link
Contributor

lightsighter commented Dec 6, 2024

I see non-deterministic hangs when running Pennant C++ when using gather copies to initialize the mesh on GPUs. Logs with -level dma=1 show that there are always more started copies than there are completed copies at the point that we hang. If I turn off the gather copies then the hangs go away (toggle the -DENABLE_GATHER_COPIES flag in the Makefile). I've eliminated the most common source of hangs by setting -gex:objcount to a very large value of 8192.

mebauer@sapling2:~/pennant-legion$ grep -rI 'started' dma_*log | wc
    213    2769   31155
mebauer@sapling2:~/pennant-legion$ grep -rI 'completed' dma_*log | wc
    185    2405   27185

To reproduce the issue on sapling, download the current master branch of both Legion and Pennant and build with the default options in the Makefile. You'll need all 4 GPU nodes with 4 GPUs/node. Use the following script to submit to sbatch from the root of the Pennant C++ repo:

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --time=00:30:00

root_dir="$PWD"

export LD_LIBRARY_PATH="$PWD"

export GASNET_PHYSMEM_MAX=16G

export REALM_BACKTRACE=1

ulimit -S -c 0 # disable core dumps

export LEGION_DEFAULT_ARGS="-ll:gpu 4 -ll:util 2 -ll:bgwork 2 -ll:csize 15000 -ll:fsize 14000 -ll:zsize 1024 -ll:rsize 512 -ll:gsize 0 -gex:obcount 8192 -lg:prof 1 -lg:prof_logfile $root_dir/prof_%.log -level dma=1,xplan=1,newdma=1 -logfile dma_%.log"

srun -n 4 -N 4 --ntasks-per-node 1 --cpu_bind none "$root_dir/pennant" -f "$root_dir"/test/leblanc/leblanc.pnt -n 16

Submit that script with sbatch -n 4 -N 4 --exclusive <script_name>. It doesn't always hang, but it was hanging ~80% of runs for me.

@apryakhin
Copy link
Contributor

Tentative patch in-progress for that one:

We shake loose some of the timing by deferring the completion of a transfer descriptor and end up re-inserting the same xd to the front therefore leaving out those that actually need to be completed first. Still need to run more tests to see if that's a valid fix.

@lightsighter
Copy link
Contributor Author

I believe that the patch fixes the hang. I see the crash from #1803 now non-deterministically, but it looks the same as what is already described in #1803.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Realm Issues pertaining to Realm
Projects
None yet
Development

No branches or pull requests

2 participants