Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when choosing ofi btl in One-Sided mode #13068

Open
angainor opened this issue Jan 28, 2025 · 0 comments
Open

Segfault when choosing ofi btl in One-Sided mode #13068

angainor opened this issue Jan 28, 2025 · 0 comments

Comments

@angainor
Copy link

angainor commented Jan 28, 2025

This is probably not so important, but it got me stuck for some time debugging. So here it is: OpenMPI 5.0.6 segfaults in MPI_Send operation when using ofi btl with btl_ofi_mode=0 (default, One-Sided only). It should instead report an error.

mpirun -mca pml ob1 -mca btl ofi,self -mca opal_common_ofi_provider_include "cxi" -prtemca ras_base_launch_orted_on_hn 1 -np 2 -map-by node ./mpitest
0/2 nid006934
rank 0 send nid006934
1/2 nid006935
[nid006934:44477] *** Process received signal ***
[nid006934:44477] Signal: Segmentation fault (11)
[nid006934:44477] Signal code: Address not mapped (1)
[nid006934:44477] Failing at address: 0xc8
[nid006934:44477] [ 0] /lib64/libpthread.so.0(+0x16910)[0x1515640f4910]
[nid006934:44477] [ 1] /users/makrotki/software/openmpi-lnx/lib/libmpi.so.40(+0x281e16)[0x1515643a1e16]
[nid006934:44477] [ 2] /users/makrotki/software/openmpi-lnx/lib/libmpi.so.40(mca_pml_ob1_send+0x5e1)[0x1515643a4291]
[nid006934:44477] [ 3] /users/makrotki/software/openmpi-lnx/lib/libmpi.so.40(MPI_Send+0x123)[0x1515641f4f23]

The segfault occurs in file pml_ob1_isend.c in the following code:

    if( NULL == bml_btl || NULL == bml_btl->btl->btl_sendi)
        return OMPI_ERR_NOT_AVAILABLE;

The reason being bml_btl->btl is NULL, but it's dereferenced in bml_btl->btl->btl_sendi. I added a check here so the code doesn't segfault anymore ( || NULL == bml_btl->btl), but then it hangs waiting for completion.

Running the same with 4.1.7 yields a valid error message:

mpirun -mca pml ob1 -mca btl ofi,self -mca opal_common_ofi_provider_include "cxi" -np 2 -map-by node ./mpitest
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[37854,1],0]) is on host: nid006934
  Process 2 ([[37854,1],1]) is on host: nid006935
  BTLs attempted: self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------

Code works when I specify btl_ofi_mode=1|2.

@angainor angainor changed the title Segfault when choosing ofi btl on One-Sided mode Segfault when choosing ofi btl in One-Sided mode Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant