Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perlmutter/Frontier Test Failure using CUDA aware MPI #102

Open
Angelyr opened this issue Jun 9, 2024 · 4 comments
Open

Perlmutter/Frontier Test Failure using CUDA aware MPI #102

Angelyr opened this issue Jun 9, 2024 · 4 comments
Labels
bug Something isn't working cuda mpi

Comments

@Angelyr
Copy link
Collaborator

Angelyr commented Jun 9, 2024

warp_test_parallel & rc_field_testp failing with the following error:

(GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 148
MPICH ERROR [Rank 0] [job id 26603385.1] [Sat Jun  8 16:52:33 2024] [nid001264] - Abort(942257666) (rank 0 in comm 0): Fatal error in PMPI_Isend: Invalid count, error stack:
PMPI_Isend(161)......................: MPI_Isend(buf=0x3200fb5c0, count=9000, MPI_DOUBLE, dest=1, tag=42, comm=0xc4000004, request=0x3bbfdf4) failed
MPID_Isend(584)......................: 
MPIDI_isend_unsafe(136)..............: 
MPIDI_SHM_mpi_isend(323).............: 
MPIDI_CRAY_Common_lmt_isend(84)......: 
MPIDI_CRAY_Common_lmt_export_mem(103): 
(unknown)(): Invalid count

aborting job:
Fatal error in PMPI_Isend: Invalid count, error stack:
PMPI_Isend(161)......................: MPI_Isend(buf=0x3200fb5c0, count=9000, MPI_DOUBLE, dest=1, tag=42, comm=0xc4000004, request=0x3bbfdf4) failed
MPID_Isend(584)......................: 
MPIDI_isend_unsafe(136)..............: 
MPIDI_SHM_mpi_isend(323).............: 
MPIDI_CRAY_Common_lmt_isend(84)......: 
MPIDI_CRAY_Common_lmt_export_mem(103): 
(unknown)(): Invalid count

Both tests output vaild results for a while before failing. Follow these instructions to reproduce:

environment script:

export root=$PWD
module load PrgEnv-gnu
module load cudatoolkit
module load cmake

export kk=$root/build-kokkos/install   # This is where kokkos will be (or is) installed
export oh=$root/build-omega_h/install  # This is where omega_h will be (or is) installed
export CMAKE_PREFIX_PATH=$kk:$kk/lib64/cmake:$oh:$CMAKE_PREFIX_PATH
export MPICH_CXX=$root/kokkos/bin/nvcc_wrapper

export SLURM_CPU_BIND="cores"

install script:

#kokkos
rm ${kk%%install} -rf
rm kokkos -rf
git clone -b 4.2.00 https://github.com/kokkos/kokkos.git
mkdir -p $kk
cmake -S kokkos -B ${kk%%install} \
  -DCMAKE_INSTALL_PREFIX=$kk \
  -DCMAKE_BUILD_TYPE="Release" \
  -DCMAKE_CXX_COMPILER=$root/kokkos/bin/nvcc_wrapper \
  -DKokkos_ARCH_AMPERE80=ON \
  -DKokkos_ENABLE_SERIAL=ON \
  -DKokkos_ENABLE_OPENMP=off \
  -DKokkos_ENABLE_CUDA=on \
  -DKokkos_ENABLE_CUDA_LAMBDA=on \
  -DKokkos_ENABLE_DEBUG=off
cmake --build ${kk%%install} -j 24 --target install

#omegah
rm ${oh%%install} -rf
rm omega_h -rf
mkdir -p $oh
git clone https://github.com/SCOREC/omega_h.git
cd omega_h && git checkout $1 && cd -
cmake -S omega_h -B ${oh%%install} \
  -DCMAKE_INSTALL_PREFIX=$oh \
  -DCMAKE_BUILD_TYPE=Release \
  -DBUILD_SHARED_LIBS=off \
  -DOmega_h_USE_Kokkos=ON \
  -DOmega_h_USE_CUDA=on \
  -DOmega_h_CUDA_ARCH=80 \
  -DOmega_h_USE_MPI=on  \
  -DMPIEXEC_EXECUTABLE=srun \
  -DBUILD_TESTING=on  \
  -DCMAKE_C_COMPILER=cc \
  -DCMAKE_CXX_COMPILER=CC \
  -DOmega_h_USE_CUDA_AWARE_MPI=ON \
  -DKokkos_PREFIX=$kk/lib64/cmake
cmake --build ${oh%%install} -j 24 --target install

allocation script:

salloc --time 00:20:00 --constraint gpu --qos=interactive --nodes=1 --ntasks-per-node=40 --cpus-per-task=1 --gpus=1 --account=XXX
@cwsmith cwsmith added bug Something isn't working mpi cuda labels Jun 10, 2024
@cwsmith cwsmith changed the title Perlmutter/Frontier Test Failure Perlmutter/Frontier Test Failure using CUDA aware MPI Jun 14, 2024
@CKegel
Copy link

CKegel commented Jun 14, 2024

I ran into the same error, below you'll find my stack trace.

#1  0x00007f60890553e5 in abort () from /lib64/libc.so.6
#2  0x00007f608a094fed in MPID_Abort.cold () from /opt/cray/pe/lib64/libmpi_gnu_123.so.12
#3  0x00007f608ba37fe8 in MPIR_Handle_fatal_error () from /opt/cray/pe/lib64/libmpi_gnu_123.so.12
#4  0x00007f608ba38113 in MPIR_Err_return_comm () from /opt/cray/pe/lib64/libmpi_gnu_123.so.12
#5  0x00007f608a7111e2 in PMPI_Isend () from /opt/cray/pe/lib64/libmpi_gnu_123.so.12
#6  0x000000000047b3df in Omega_h::Neighbor_ialltoallv (sources=..., destinations=..., width=1, sendbuf=0x624290a80, sdispls=0x3477240, sendtype=1275069445, recvbuf=0x6244e9680, rdispls=0x3477380, recvtype=1275069445, comm=-2080374779, 
    sendbuf_size=307480, recvbuf_size=307250) at /pscratch/sd/c/ckegel/Dist-Sync-Testing/debug/omega_h/src/Omega_h_comm.cpp:350
#7  0x000000000047db3b in Omega_h::Comm::alltoallv<int> (this=0x30c7540, sendbuf_dev=..., sdispls_dev=..., rdispls_dev=..., width=1) at /pscratch/sd/c/ckegel/Dist-Sync-Testing/debug/omega_h/src/Omega_h_comm.cpp:552
#8  0x0000000000775217 in Omega_h::Dist::exch<int> (this=0x1a7f630, data=..., width=1) at /pscratch/sd/c/ckegel/Dist-Sync-Testing/debug/omega_h/src/Omega_h_dist.cpp:118
#9  0x00000000007713f4 in Omega_h::Dist::set_dest_idxs (this=0x1a7f630, fitems2rroots=..., nrroots=307480) at /pscratch/sd/c/ckegel/Dist-Sync-Testing/debug/omega_h/src/Omega_h_dist.cpp:78
#10 0x0000000000770af6 in Omega_h::Dist::Dist (this=0x1a7f630, comm_in=..., fitems2rroots=..., nrroots=307480) at /pscratch/sd/c/ckegel/Dist-Sync-Testing/debug/omega_h/src/Omega_h_dist.cpp:23
#11 0x00000000004ce196 in std::_Construct<Omega_h::Dist, std::shared_ptr<Omega_h::Comm>&, Omega_h::Remotes&, int> (__p=0x1a7f630) at /usr/include/c++/12/bits/stl_construct.h:119
#12 0x00000000004c9dc4 in std::allocator_traits<std::allocator<void> >::construct<Omega_h::Dist, std::shared_ptr<Omega_h::Comm>&, Omega_h::Remotes&, int> (__p=0x1a7f630) at /usr/include/c++/12/bits/alloc_traits.h:635
#13 0x00000000004c7faa in std::_Sp_counted_ptr_inplace<Omega_h::Dist, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<std::shared_ptr<Omega_h::Comm>&, Omega_h::Remotes&, int> (this=0x1a7f620, __a=...)
    at /usr/include/c++/12/bits/shared_ptr_base.h:604
#14 0x00000000004c62fe in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<Omega_h::Dist, std::allocator<void>, std::shared_ptr<Omega_h::Comm>&, Omega_h::Remotes&, int> (this=0x7ffd863d7698, __p=@0x7ffd863d7690: 0x0, 
    __a=...) at /usr/include/c++/12/bits/shared_ptr_base.h:971
#15 0x00000000004c3de0 in std::__shared_ptr<Omega_h::Dist, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<void>, std::shared_ptr<Omega_h::Comm>&, Omega_h::Remotes&, int> (this=0x7ffd863d7690, __tag=...)
    at /usr/include/c++/12/bits/shared_ptr_base.h:1711
#16 0x00000000004c2135 in std::shared_ptr<Omega_h::Dist>::shared_ptr<std::allocator<void>, std::shared_ptr<Omega_h::Comm>&, Omega_h::Remotes&, int> (this=0x7ffd863d7690, __tag=...) at /usr/include/c++/12/bits/shared_ptr.h:463
#17 0x00000000004bfc26 in std::make_shared<Omega_h::Dist, std::shared_ptr<Omega_h::Comm>&, Omega_h::Remotes&, int> () at /usr/include/c++/12/bits/shared_ptr.h:1009
#18 0x00000000004ad949 in Omega_h::Mesh::ask_dist (this=0x7ffd863d7bd0, ent_dim=0) at /pscratch/sd/c/ckegel/Dist-Sync-Testing/debug/omega_h/src/Omega_h_mesh.cpp:551
#19 0x00000000004a89d1 in Omega_h::Mesh::set_comm (this=0x7ffd863d7bd0, new_comm=...) at /pscratch/sd/c/ckegel/Dist-Sync-Testing/debug/omega_h/src/Omega_h_mesh.cpp:69
#20 0x0000000000487177 in Omega_h::binary::read (path=..., comm=..., mesh=0x7ffd863d7bd0, strict=false) at /pscratch/sd/c/ckegel/Dist-Sync-Testing/debug/omega_h/src/Omega_h_file.cpp:598
#21 0x000000000040d6c6 in main (argc=3, argv=0x7ffd863d8c88) at /pscratch/sd/c/ckegel/Dist-Sync-Testing/meshFieldsDist/testDist.cpp:21

I am running this test (https://github.com/SCOREC/meshFieldsDist/blob/performance_testing/testDist.cpp) with Cuda aware enabled and submitting my job with the following script:

#SBATCH --nodes=1
#SBATCH --time=00:10:00
#SBATCH --constraint=gpu
#SBATCH --qos=debug
#SBATCH --account=XXXXX
#SBATCH --gpus-per-node=4
# set up for problem & define any environment variables here
export MPICH_GPU_SUPPORT_ENABLED=1

Note: I've noticed this only occurs for meshes over approximately 600k elements.

@CKegel
Copy link

CKegel commented Jun 14, 2024

I also tried using CUDA_LAUNCH_BLOCKING and observed the same behavior.

@CKegel
Copy link

CKegel commented Jul 1, 2024

@Angelyr - after interfacing with NERSC support, it appears that the kokkos parameter Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC is known to break CUDA Aware MPI on perlmutter. Try building Kokkos with Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF. I have verified this on both the warp test and the meshfields code that I have been working on.

@cwsmith
Copy link

cwsmith commented Jul 1, 2024

Note, there was a significant performance penalty with Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF relative to cuda aware disabled in Omegah and Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=ON. We are checking if enabling the Omega_h mempool will recover the lost performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda mpi
Projects
None yet
Development

No branches or pull requests

3 participants