You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I see non-deterministic hangs when running Pennant C++ when using gather copies to initialize the mesh on GPUs. Logs with -level dma=1 show that there are always more started copies than there are completed copies at the point that we hang. If I turn off the gather copies then the hangs go away (toggle the -DENABLE_GATHER_COPIES flag in the Makefile). I've eliminated the most common source of hangs by setting -gex:objcount to a very large value of 8192.
To reproduce the issue on sapling, download the current master branch of both Legion and Pennant and build with the default options in the Makefile. You'll need all 4 GPU nodes with 4 GPUs/node. Use the following script to submit to sbatch from the root of the Pennant C++ repo:
We shake loose some of the timing by deferring the completion of a transfer descriptor and end up re-inserting the same xd to the front therefore leaving out those that actually need to be completed first. Still need to run more tests to see if that's a valid fix.
I believe that the patch fixes the hang. I see the crash from #1803 now non-deterministically, but it looks the same as what is already described in #1803.
I see non-deterministic hangs when running Pennant C++ when using gather copies to initialize the mesh on GPUs. Logs with
-level dma=1
show that there are always more started copies than there are completed copies at the point that we hang. If I turn off the gather copies then the hangs go away (toggle the-DENABLE_GATHER_COPIES
flag in the Makefile). I've eliminated the most common source of hangs by setting-gex:objcount
to a very large value of 8192.To reproduce the issue on sapling, download the current
master
branch of both Legion and Pennant and build with the default options in the Makefile. You'll need all 4 GPU nodes with 4 GPUs/node. Use the following script to submit tosbatch
from the root of the Pennant C++ repo:Submit that script with
sbatch -n 4 -N 4 --exclusive <script_name>
. It doesn't always hang, but it was hanging ~80% of runs for me.The text was updated successfully, but these errors were encountered: