Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault sometimes when building with CUDA support #13044

Open
G-Ragghianti opened this issue Jan 16, 2025 · 3 comments
Open

Segfault sometimes when building with CUDA support #13044

G-Ragghianti opened this issue Jan 16, 2025 · 3 comments

Comments

@G-Ragghianti
Copy link

G-Ragghianti commented Jan 16, 2025

Background information

What version of Open MPI are you using?

v5.0.6
v5.0.0

Describe how Open MPI was installed

Source install in docker:

FROM rockylinux:9

RUN dnf -y group install development
COPY cuda.repo /etc/yum.repos.d/
RUN dnf -y install epel-release gfortran
RUN dnf -y module install nvidia-driver:555-dkms
RUN dnf -y install cuda-12.5.0
WORKDIR /tmp
RUN curl https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.6.tar.bz2 | tar -jx
RUN cd openmpi* && ./configure --enable-cuda --enable-debug && make -j && make install
RUN echo -e "#include <mpi.h>\n int main(void) { MPI_Init(NULL, NULL); MPI_Finalize(); }" > /tmp/test.c
USER nobody
CMD mpicc -o /tmp/test /tmp/test.c && \
    for i in {0..99}; do \
       gdb -ex "set confirm off" -ex r -ex q --args mpirun -n 8 /tmp/test; \
    done

Please describe the system on which you are running

  • Operating system/version: Docker with rockylinux:9, kernel 6.11.9-100.fc39.x86_64
  • Computer hardware: x86_64 Intel Core i9, no NVidia GPU
  • Network type: loopback only

Details of the problem

Segfault occurs about 5% of the time when running a minimal test program (only MPI_Init and MPI_Finalize). This occurs when building with CUDA support but without NVidia hardware. It appears to only occur when building hwloc with CUDA support and OMPI itself without CUDA support. So far, I have been unable to get a useful stack trace for some reason. Help with this is appreciated. The Dockerfile to reproduce is above. Example output:

$ docker build -t test .
$ docker run -it --rm test
GNU gdb (Rocky Linux) 14.2-3.el9
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from mpirun...
Starting program: /usr/local/bin/mpirun -n 8 /tmp/test
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
process 625 is executing new program: /usr/local/bin/prte
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-125.el9_5.1.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[Detaching after vfork from child process 628]
[Detaching after fork from child process 630]
[Detaching after fork from child process 632]
[Detaching after fork from child process 634]
[Detaching after fork from child process 636]
[New Thread 0x7f69273ed640 (LWP 638)]
[Detaching after vfork from child process 639]
[Detaching after fork from child process 641]
[Detaching after fork from child process 643]
[Detaching after fork from child process 645]
[Detaching after fork from child process 647]
[Detaching after vfork from child process 649]
[Detaching after vfork from child process 651]
[Detaching after vfork from child process 653]
[New Thread 0x7f6928ef3640 (LWP 655)]
[New Thread 0x7f69286f2640 (LWP 656)]
[Detaching after fork from child process 657]
[Detaching after fork from child process 658]
[Detaching after fork from child process 659]
[Detaching after fork from child process 660]
[Detaching after fork from child process 661]
[Detaching after fork from child process 662]
[Detaching after fork from child process 663]
[Detaching after fork from child process 665]

Thread 2 "cuda00002c0000c" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f69273ed640 (LWP 638)]
0x00007f692762f64f in ?? ()
@G-Ragghianti
Copy link
Author

@abouteiller

@csdewa
Copy link

csdewa commented Jan 20, 2025

To resolve the segmentation fault with Open MPI in Docker, ensure proper compatibility between CUDA and hwloc builds. Specifically:

  • Avoid building hwloc with CUDA support if the hardware doesn't have NVIDIA GPUs.
  • Rebuild Open MPI with consistent --enable-cuda and --enable-debug flags matching your system setup.
  • Add debugging symbols and use GDB to investigate further.

For more insights, share detailed GDB backtraces and verify all dependencies are correctly installed.

Best Regard
Nekopoi

@G-Ragghianti
Copy link
Author

Thank you for the reply. I understand that building without cuda support will avoid the problem, but in this case, the installation of openmpi is for a heterogenous cluster and is to be used on all systems whether they have nvidia hardware or not. It seems like this should work, and it does 95% of the time.

I have built it with --enable-debug and used "-g" in the build of everything in the software stack, but it does not show a detailed backtrace with symbols. What do I need to do to get this information?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants