Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analyse intel_transport_recv.h at line 1160: cma_read_nbytes == size assert #612

Open
lslusarczyk opened this issue Oct 31, 2023 · 7 comments
Assignees

Comments

@lslusarczyk
Copy link
Contributor

Update: Bug in MPI Jira: https://jira.devtools.intel.com/browse/IMPI-4619

when running on devcloud
ctest -R mhp-sycl-sort-tests-3

on branch https://github.com/lslusarczyk/distributed-ranges/tree/mateusz_sort_expose_mpi_assert

we hit

Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_recv.h at line 1160: cma_read_nbytes == size
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x14c1a5a7236c]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x14c1a5429131]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(+0xb22e38) [0x14c1a5922e38]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(+0xb1fa41) [0x14c1a591fa41]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(+0xb1cd4d) [0x14c1a591cd4d]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(+0x2f58b4) [0x14c1a50f58b4]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(PMPI_Wait+0x41f) [0x14c1a56816af]
./mhp-tests() [0x5c863d]
./mhp-tests() [0x58e124]
./mhp-tests() [0x6cdd0c]
./mhp-tests() [0x75676c]
./mhp-tests() [0x7374c5]
./mhp-tests() [0x738b33]
./mhp-tests() [0x73974f]
./mhp-tests() [0x74df0f]
./mhp-tests() [0x74cfcb]
./mhp-tests() [0x472f7f]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x14c1a3ce3d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x14c1a3ce3e40]
./mhp-tests() [0x46f005]
Abort(1) on node 0: Internal error

Some links on useful Intel MPI documentation, tips and hacks:

Intel® MPI for GPU Clusters - article
https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-2/intel-mpi-for-gpu-clusters.html

Environment variables influencing the way GPU support works.

https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-10/gpu-support.html
https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-10/gpu-buffers-support.html
https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-10/gpu-pinning.html

Still, I found the tip for solution of the problem here:
https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/intel-mpi-error-line-1334-cma-read-nbytes-size/m-p/1329220

export I_MPI_SHM_CMA=0 helped in some cases (yet the behaviour seems to be not fully deterministic, maybe depends on which devcloud node is assigned for execution)

People had similar problems in the past:
https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/Intel-oneAPI-2021-4-SHM-Issue/m-p/1324805

When setting the env vars to:

export I_MPI_FABRICS=shm
export I_MPI_SHM_CMA=0
export I_MPI_OFFLOAD=1

You may also encounter:

Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_send.h at line 2012: FALSE
...

Still, simple solution - copy memory from device to host - is countereffective, as IMPI supports GPU-GPU communication
(see https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-10/gpu-buffers-support.html#SECTION_3F5D70BDEFF84E3A84325A319BA53536)

@lslusarczyk
Copy link
Contributor Author

Blocked on unable to install MPI 2021.11. We don't know where to get if from.

@rscohn2
Copy link
Member

rscohn2 commented Oct 31, 2023

Blocked on unable to install MPI 2021.11. We don't know where to get if from.

It's here: http://anpfclxlin02.an.intel.com/rscohn1/

@mateuszpn
Copy link
Contributor

Problem with assert in intel_transport_send.h at line 2012 is solved in IMPI 2021.11 (tested on devcloud, with IMPI 2021.11 installed in home dir)

@rscohn2
Copy link
Member

rscohn2 commented Nov 7, 2023

2021.11 will be published on 11/17

@mateuszpn
Copy link
Contributor

mateuszpn commented Nov 7, 2023

I_MPI_OFFLOAD=0 mpirun -n 2 ./build/benchmarks/gbench/mhp/mhp-bench --sycl --benchmark_filter=Sort_DR -> Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_recv.h at line 1175: cma_read_nbytes == size

However, with I_MPI_OFFLOAD=1 (which should be used with IMPI on GPU) execution of Sort benchmark is successful (devcloud single server, multi-GPU). IMPI 2021.11 private install.

@mateuszpn mateuszpn moved this from 🏗 In progress to 👀 In review in Distributed-Ranges Project Nov 7, 2023
@rscohn2
Copy link
Member

rscohn2 commented Nov 7, 2023

This is how I set I_MPI_OFFLOAD for the device memory tests:

function(add_mhp_offload_ctest test_name name processes)

@rscohn2
Copy link
Member

rscohn2 commented Nov 7, 2023

I was told that for the 2021.11 release we can set I_MPI_OFFLOAD=1 all the time and it will not cause an error. I will get rid of this function and set I_MPI_OFFLOAD=1 in the CI script.

@lslusarczyk lslusarczyk moved this from 👀 In review to ✅ Done in Distributed-Ranges Project Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants