analyse intel_transport_recv.h at line 1160: cma_read_nbytes == size assert #612

lslusarczyk · 2023-10-31T09:02:36Z

Update: Bug in MPI Jira: https://jira.devtools.intel.com/browse/IMPI-4619

when running on devcloud
ctest -R mhp-sycl-sort-tests-3

on branch https://github.com/lslusarczyk/distributed-ranges/tree/mateusz_sort_expose_mpi_assert

we hit

Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_recv.h at line 1160: cma_read_nbytes == size
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x14c1a5a7236c]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x14c1a5429131]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(+0xb22e38) [0x14c1a5922e38]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(+0xb1fa41) [0x14c1a591fa41]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(+0xb1cd4d) [0x14c1a591cd4d]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(+0x2f58b4) [0x14c1a50f58b4]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(PMPI_Wait+0x41f) [0x14c1a56816af]
./mhp-tests() [0x5c863d]
./mhp-tests() [0x58e124]
./mhp-tests() [0x6cdd0c]
./mhp-tests() [0x75676c]
./mhp-tests() [0x7374c5]
./mhp-tests() [0x738b33]
./mhp-tests() [0x73974f]
./mhp-tests() [0x74df0f]
./mhp-tests() [0x74cfcb]
./mhp-tests() [0x472f7f]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x14c1a3ce3d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x14c1a3ce3e40]
./mhp-tests() [0x46f005]
Abort(1) on node 0: Internal error

Some links on useful Intel MPI documentation, tips and hacks:

Intel® MPI for GPU Clusters - article
https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-2/intel-mpi-for-gpu-clusters.html

Environment variables influencing the way GPU support works.

https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-10/gpu-support.html
https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-10/gpu-buffers-support.html
https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-10/gpu-pinning.html

Still, I found the tip for solution of the problem here:
https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/intel-mpi-error-line-1334-cma-read-nbytes-size/m-p/1329220

export I_MPI_SHM_CMA=0 helped in some cases (yet the behaviour seems to be not fully deterministic, maybe depends on which devcloud node is assigned for execution)

People had similar problems in the past:
https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/Intel-oneAPI-2021-4-SHM-Issue/m-p/1324805

When setting the env vars to:

export I_MPI_FABRICS=shm
export I_MPI_SHM_CMA=0
export I_MPI_OFFLOAD=1

You may also encounter:

Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_send.h at line 2012: FALSE
...

Still, simple solution - copy memory from device to host - is countereffective, as IMPI supports GPU-GPU communication
(see https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-10/gpu-buffers-support.html#SECTION_3F5D70BDEFF84E3A84325A319BA53536)

The text was updated successfully, but these errors were encountered:

lslusarczyk · 2023-10-31T09:03:16Z

Blocked on unable to install MPI 2021.11. We don't know where to get if from.

rscohn2 · 2023-10-31T12:26:10Z

Blocked on unable to install MPI 2021.11. We don't know where to get if from.

It's here: http://anpfclxlin02.an.intel.com/rscohn1/

mateuszpn · 2023-11-07T11:36:46Z

Problem with assert in intel_transport_send.h at line 2012 is solved in IMPI 2021.11 (tested on devcloud, with IMPI 2021.11 installed in home dir)

rscohn2 · 2023-11-07T12:32:57Z

2021.11 will be published on 11/17

mateuszpn · 2023-11-07T13:52:26Z

I_MPI_OFFLOAD=0 mpirun -n 2 ./build/benchmarks/gbench/mhp/mhp-bench --sycl --benchmark_filter=Sort_DR -> Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_recv.h at line 1175: cma_read_nbytes == size

However, with I_MPI_OFFLOAD=1 (which should be used with IMPI on GPU) execution of Sort benchmark is successful (devcloud single server, multi-GPU). IMPI 2021.11 private install.

rscohn2 · 2023-11-07T15:18:54Z

This is how I set I_MPI_OFFLOAD for the device memory tests:

distributed-ranges/CMakeLists.txt

Line 216 in 6ad80e7

function(add_mhp_offload_ctest test_name name processes)

rscohn2 · 2023-11-07T15:20:38Z

I was told that for the 2021.11 release we can set I_MPI_OFFLOAD=1 all the time and it will not cause an error. I will get rid of this function and set I_MPI_OFFLOAD=1 in the CI script.

lslusarczyk added this to Distributed-Ranges Project Sep 18, 2023

lslusarczyk assigned mateuszpn Oct 31, 2023

lslusarczyk converted this from a draft issue Oct 31, 2023

mateuszpn moved this from 🏗 In progress to 👀 In review in Distributed-Ranges Project Nov 7, 2023

lslusarczyk moved this from 👀 In review to ✅ Done in Distributed-Ranges Project Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analyse intel_transport_recv.h at line 1160: cma_read_nbytes == size assert #612

analyse intel_transport_recv.h at line 1160: cma_read_nbytes == size assert #612

lslusarczyk commented Oct 31, 2023

lslusarczyk commented Oct 31, 2023

rscohn2 commented Oct 31, 2023 •

edited

Loading

mateuszpn commented Nov 7, 2023

rscohn2 commented Nov 7, 2023

mateuszpn commented Nov 7, 2023 •

edited

Loading

rscohn2 commented Nov 7, 2023

rscohn2 commented Nov 7, 2023

analyse intel_transport_recv.h at line 1160: cma_read_nbytes == size assert #612

analyse intel_transport_recv.h at line 1160: cma_read_nbytes == size assert #612

Comments

lslusarczyk commented Oct 31, 2023

Update: Bug in MPI Jira: https://jira.devtools.intel.com/browse/IMPI-4619

lslusarczyk commented Oct 31, 2023

rscohn2 commented Oct 31, 2023 • edited Loading

mateuszpn commented Nov 7, 2023

rscohn2 commented Nov 7, 2023

mateuszpn commented Nov 7, 2023 • edited Loading

rscohn2 commented Nov 7, 2023

rscohn2 commented Nov 7, 2023

rscohn2 commented Oct 31, 2023 •

edited

Loading

mateuszpn commented Nov 7, 2023 •

edited

Loading