You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FROM rockylinux:9
RUN dnf -y group install development
COPY cuda.repo /etc/yum.repos.d/
RUN dnf -y install epel-release gfortran
RUN dnf -y module install nvidia-driver:555-dkms
RUN dnf -y install cuda-12.5.0
WORKDIR /tmp
RUN curl https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.6.tar.bz2 | tar -jx
RUN cd openmpi* && ./configure --enable-cuda --enable-debug && make -j && make install
RUN echo -e "#include <mpi.h>\n int main(void) { MPI_Init(NULL, NULL); MPI_Finalize(); }" > /tmp/test.c
USER nobody
CMD mpicc -o /tmp/test /tmp/test.c && \
for i in {0..99}; do \
gdb -ex "set confirm off" -ex r -ex q --args mpirun -n 8 /tmp/test; \
done
Please describe the system on which you are running
Operating system/version: Docker with rockylinux:9, kernel 6.11.9-100.fc39.x86_64
Computer hardware: x86_64 Intel Core i9, no NVidia GPU
Network type: loopback only
Details of the problem
Segfault occurs about 5% of the time when running a minimal test program (only MPI_Init and MPI_Finalize). This occurs when building with CUDA support but without NVidia hardware. It appears to only occur when building hwloc with CUDA support and OMPI itself without CUDA support. So far, I have been unable to get a useful stack trace for some reason. Help with this is appreciated. The Dockerfile to reproduce is above. Example output:
$ docker build -t test.
$ docker run -it --rm test
GNU gdb (Rocky Linux) 14.2-3.el9
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty"for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration"for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type"help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from mpirun...
Starting program: /usr/local/bin/mpirun -n 8 /tmp/test
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
process 625 is executing new program: /usr/local/bin/prte
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-125.el9_5.1.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[Detaching after vfork from child process 628]
[Detaching after fork from child process 630]
[Detaching after fork from child process 632]
[Detaching after fork from child process 634]
[Detaching after fork from child process 636]
[New Thread 0x7f69273ed640 (LWP 638)]
[Detaching after vfork from child process 639]
[Detaching after fork from child process 641]
[Detaching after fork from child process 643]
[Detaching after fork from child process 645]
[Detaching after fork from child process 647]
[Detaching after vfork from child process 649]
[Detaching after vfork from child process 651]
[Detaching after vfork from child process 653]
[New Thread 0x7f6928ef3640 (LWP 655)]
[New Thread 0x7f69286f2640 (LWP 656)]
[Detaching after fork from child process 657]
[Detaching after fork from child process 658]
[Detaching after fork from child process 659]
[Detaching after fork from child process 660]
[Detaching after fork from child process 661]
[Detaching after fork from child process 662]
[Detaching after fork from child process 663]
[Detaching after fork from child process 665]
Thread 2 "cuda00002c0000c" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f69273ed640 (LWP 638)]
0x00007f692762f64f in?? ()
The text was updated successfully, but these errors were encountered:
Thank you for the reply. I understand that building without cuda support will avoid the problem, but in this case, the installation of openmpi is for a heterogenous cluster and is to be used on all systems whether they have nvidia hardware or not. It seems like this should work, and it does 95% of the time.
I have built it with --enable-debug and used "-g" in the build of everything in the software stack, but it does not show a detailed backtrace with symbols. What do I need to do to get this information?
Background information
What version of Open MPI are you using?
v5.0.6
v5.0.0
Describe how Open MPI was installed
Source install in docker:
Please describe the system on which you are running
Details of the problem
Segfault occurs about 5% of the time when running a minimal test program (only MPI_Init and MPI_Finalize). This occurs when building with CUDA support but without NVidia hardware. It appears to only occur when building hwloc with CUDA support and OMPI itself without CUDA support. So far, I have been unable to get a useful stack trace for some reason. Help with this is appreciated. The Dockerfile to reproduce is above. Example output:
The text was updated successfully, but these errors were encountered: