Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encountered "named symbol not found" error when I tried to run PMF on RTX4060 #809

Open
himcraft opened this issue Jun 4, 2024 · 2 comments

Comments

@himcraft
Copy link

himcraft commented Jun 4, 2024

Hello. Recently I wanted to run the PMF case with my laptop GPU, I changed USE_CUDA to TRUE in GNUmakefile and recompiled following the instruction on the document. While when I run ./PeleC3d.gnu.CUDA.ex pmf-lidryer-cvode.inp, it prompted

Initializing AMReX (23.12-8-g43d71da32fa4)...
Initializing CUDA...
CUDA initialized with 1 device.
amrex::Abort::0::GPU last error detected in file /home/himcraft/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 885: named symbol not found !!!
SIGABRT
See Backtrace.0 file for details

The contents in Backtrace.0 are

Host Name: himcraft
=== If no file names and line numbers are shown below, one can run
            addr2line -Cpfie my_exefile my_line_address
    to convert `my_line_address` (e.g., 0x4a6b) into file name and line number.
    Or one can use amrex/Tools/Backtrace/parse_bt.py.

=== Please note that the line number reported by addr2line may not be accurate.
    One can use
            readelf -wl my_exefile | grep my_line_address'
    to find out the offset for that line.

 0: ./PeleC3d.gnu.CUDA.ex(+0x27f0e0) [0x55603e1a90e0]
    amrex::BLBackTrace::print_backtrace_info(_IO_FILE*) at /usr/include/x86_64-linux-gnu/bits/unistd.h:349
 (inlined by) amrex::BLBackTrace::print_backtrace_info(_IO_FILE*) at /home/himcraft/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_BLBackTrace.cpp:199

 1: ./PeleC3d.gnu.CUDA.ex(+0x280f25) [0x55603e1aaf25]
    amrex::BLBackTrace::handler(int) at /home/himcraft/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_BLBackTrace.cpp:99

 2: ./PeleC3d.gnu.CUDA.ex(+0x141001) [0x55603e06b001]
    std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_is_local() const at /usr/include/c++/9/bits/basic_string.h:226
 (inlined by) std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_dispose() at /usr/include/c++/9/bits/basic_string.h:235
 (inlined by) std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() at /usr/include/c++/9/bits/basic_string.h:662
 (inlined by) amrex::Gpu::ErrorCheck(char const*, int) at /home/himcraft/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuError.H:54
 (inlined by) std::enable_if<amrex::MaybeDeviceRunnable<__nv_dl_wrapper_t<__nv_dl_tag<void (*)(unsigned long), &(anonymous namespace)::ResizeRandomSeed, 1u>, unsigned long, curandStateXORWOW*>, void>::value, void>::type amrex::ParallelFor<256, int, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(unsigned long), &(anonymous namespace)::ResizeRandomSeed, 1u>, unsigned long, curandStateXORWOW*>, void>(amrex::Gpu::KernelInfo const&, int, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(unsigned long), &(anonymous namespace)::ResizeRandomSeed, 1u>, unsigned long, curandStateXORWOW*>&&) at /home/himcraft/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:885
 (inlined by) void amrex::ParallelFor<int, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(unsigned long), &(anonymous namespace)::ResizeRandomSeed, 1u>, unsigned long, curandStateXORWOW*>, void>(int, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(unsigned long), &(anonymous namespace)::ResizeRandomSeed, 1u>, unsigned long, curandStateXORWOW*>&&) at /home/himcraft/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:1457
 (inlined by) (anonymous namespace)::ResizeRandomSeed(unsigned long) at /home/himcraft/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_Random.cpp:60

 3: ./PeleC3d.gnu.CUDA.ex(+0x105894) [0x55603e02f894]
    amrex::Initialize(int&, char**&, bool, int, std::function<void ()> const&, std::ostream&, std::ostream&, void (*)(char const*)) at /home/himcraft/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX.cpp:628

 4: ./PeleC3d.gnu.CUDA.ex(+0x48492) [0x55603df72492]
    std::_Function_base::~_Function_base() at /usr/include/c++/9/bits/std_function.h:259
 (inlined by) std::function<void ()>::~function() at /usr/include/c++/9/bits/std_function.h:369
 (inlined by) main at /home/himcraft/PeleC/Exec/RegTests/PMF/../../../Source/main.cpp:58

 5: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f1dac95c083]

 6: ./PeleC3d.gnu.CUDA.ex(+0x4a13e) [0x55603df7413e]
    ?? ??:0

The CUDA version is 12.1. Could it be my CUDA driver is installed wrongly? While I can run my own .cu code without error.

Thanks in advance.

@SRkumar97
Copy link

SRkumar97 commented Jun 24, 2024

Hi, I am also facing a similar issue.
I tried to run the EB-C14 compression ramp case in a dedicated GPU cluster by setting CUDA and MPI flags to TRUE. I kept nprocs=16, ngpu=1 thereby np=16
The case fails to start, reporting an out of memory error by AMReX_Arena.cpp; and an error generated by the same line 749 in AMReX_GpuLaunchFunctsG.H file

Multiple GPUs are visible to each MPI rank, This may lead to incorrect or suboptimal rank-to-GPU mapping.!
There are more MPI processes than the number of GPUs.!
amrex::Abort::10::CUDA error 2 in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_Arena.cpp line 193: out of memory !!!
SIGABRT
amrex::Abort::9::CUDA error 2 in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_Arena.cpp line 193: out of memory !!!
SIGABRT
amrex::Abort::1::CUDA error 2 in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_Arena.cpp line 193: out of memory !!!
SIGABRT
amrex::Abort::4::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!!
SIGABRT
amrex::Abort::12::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!!
SIGABRT
amrex::Abort::7::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!!
SIGABRT
amrex::Abort::14::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!!
SIGABRT
amrex::Abort::8::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!!
SIGABRT
amrex::Abort::6::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!!
SIGABRT
amrex::Abort::15::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!!
SIGABRT
amrex::Abort::0::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!!
SIGABRT
amrex::Abort::13::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!!
SIGABRT
amrex::Abort::5::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!!
SIGABRT
amrex::Abort::2::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!!
SIGABRT
amrex::Abort::3::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!!
SIGABRT
amrex::Abort::11::GPU last error detected in file /nlsasfs/home/fdmod/srkumar/PeleC/Submodules/PelePhysics/Submodules/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 749: invalid device function !!!
SIGABRT
See Backtrace.4 file for details
See Backtrace.12 file for details
See Backtrace.7 file for details
See Backtrace.15 file for details
See Backtrace.6 file for details
See Backtrace.14 file for details
See Backtrace.5 file for details
See Backtrace.13 file for details
See Backtrace.0 file for details
See Backtrace.8 file for details
See Backtrace.9 file for details
See Backtrace.1 file for details
See Backtrace.2 file for details
See Backtrace.10 file for details
See Backtrace.3 file for details
See Backtrace.11 file for details

MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
with errorcode 6.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

[scn50-mn:1634243] 9 more processes have sent help message help-mpi-api.txt / mpi-abort
[scn50-mn:1634243] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

The first issue, i.e. out of memory error caused by number of MPI processes, goes off once I adjust the np count to be same as ngpus. However, the second error, reported by the AMReX_GpuLaunchFunctsG.H file, is still there.
Requesting for help!

@himcraft
Copy link
Author

It is probably due to cuda version IMO.

When I switched to 12.6 instead of 12.2, the error disappeared. The same error occurred when I recently tested on another HPC system using cuda 12.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants