You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is probably not so important, but it got me stuck for some time debugging. So here it is: OpenMPI 5.0.6 segfaults in MPI_Send operation when using ofi btl with btl_ofi_mode=0 (default, One-Sided only). It should instead report an error.
The reason being bml_btl->btl is NULL, but it's dereferenced in bml_btl->btl->btl_sendi. I added a check here so the code doesn't segfault anymore ( || NULL == bml_btl->btl), but then it hangs waiting for completion.
Running the same with 4.1.7 yields a valid error message:
mpirun -mca pml ob1 -mca btl ofi,self -mca opal_common_ofi_provider_include "cxi" -np 2 -map-by node ./mpitest
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[37854,1],0]) is on host: nid006934
Process 2 ([[37854,1],1]) is on host: nid006935
BTLs attempted: self
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
Code works when I specify btl_ofi_mode=1|2.
The text was updated successfully, but these errors were encountered:
angainor
changed the title
Segfault when choosing ofi btl on One-Sided mode
Segfault when choosing ofi btl in One-Sided mode
Jan 28, 2025
This is probably not so important, but it got me stuck for some time debugging. So here it is: OpenMPI 5.0.6 segfaults in
MPI_Send
operation when using ofi btl withbtl_ofi_mode=0
(default, One-Sided only). It should instead report an error.The segfault occurs in file
pml_ob1_isend.c
in the following code:The reason being
bml_btl->btl
isNULL
, but it's dereferenced inbml_btl->btl->btl_sendi
. I added a check here so the code doesn't segfault anymore (|| NULL == bml_btl->btl
), but then it hangs waiting for completion.Running the same with 4.1.7 yields a valid error message:
Code works when I specify
btl_ofi_mode=1|2
.The text was updated successfully, but these errors were encountered: