Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: getrs serial internal implementations #2488

Merged
merged 3 commits into from
Feb 3, 2025

Conversation

yasahi-hpc
Copy link
Contributor

@yasahi-hpc yasahi-hpc commented Jan 27, 2025

Fixes #2485

  • unit-test passes with H100 and Cuda 12.0.0. There seems to be a compiler bug in Cuda 12.0.0 which applies an aggressive loop unroll that crashes the SerialLaswpVectorBackwardInternal. This can be avoided by disallowing the loop unrolling inside this function. I did not observe failures for other Cuda versions.
  • Remove using namespace KokkosBatched from getrs unit-test

@cwpearson
It seems fine on my env, but could you please test on your side?

Copy link
Contributor

@lucbv lucbv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's give it a try

@lucbv lucbv added the AT2-SPECIAL-APPROVAL Mark .github changes as approved. label Jan 27, 2025
@cwpearson
Copy link
Contributor

Any theories as to why removing the wrapping struct fixes this?

@lucbv lucbv removed the AT2-SPECIAL-APPROVAL Mark .github changes as approved. label Jan 27, 2025
@lucbv
Copy link
Contributor

lucbv commented Jan 27, 2025

I guess instead of overloading the whole struct you only get the call to invoke to be instantiated, I am not sure how that affects the other members of the struct?

@yasahi-hpc
Copy link
Contributor Author

Any theories as to why removing the wrapping struct fixes this?

Not very clear to me. Even if it there is a bug in previous implementation, the failure for Transpose only with BlkSize >= 4 is difficult to understand.
@cwpearson Does this change fix the issue on your env?

I guess instead of overloading the whole struct you only get the call to invoke to be instantiated, I am not sure how that affects the other members of the struct?

I agree. At least, the current implementation is straightforward and is typically used in batched functions.

@cwpearson
Copy link
Contributor

cwpearson commented Jan 28, 2025

Fortunately (for my sanity) but unfortunately for #2485, the issue is not quite resolved. Slight difference to how it manifested in 2485:

__remote_shared__ read rather than __global__

[ RUN      ] Cuda.test_batched_getrs_nt_double
[       OK ] Cuda.test_batched_getrs_nt_double (212 ms)
[ RUN      ] Cuda.test_batched_getrs_t_double
.../kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.472168 vs 2.22045e-13
.../kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.632412 vs 2.22045e-13
.../kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.484741 vs 2.22045e-13
.../kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1.11126 vs 2.22045e-13
========= Invalid __remote_shared__ read of size 4 bytes
=========     at 0x2610 in void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double ***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<int **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<double **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>>, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double ***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<int **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<double **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>>, Kokkos::Cuda>>(T1)
=========     by thread (0,1,0) in block (0,0,0)
=========     Address 0x16ec9c in CTA 253 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x32e950]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x1490c]
=========                in /projects/x86-64-icelake-rocky8/tpls/cuda/12.0.0/gcc/11.3.0/base/ksndyya/lib64/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x6bb4b]
=========                in /projects/x86-64-icelake-rocky8/tpls/cuda/12.0.0/gcc/11.3.0/base/ksndyya/lib64/libcudart.so.12
=========     Host Frame:cudaError cudaLaunchKernel<char>(char const*, dim3, dim3, void**, unsigned long, CUstream_st*) [0x2d3a1]
=========                in .../build-kernels/batched/dense/unit_test/KokkosKernels_batched_dla_cuda
=========     Host Frame:__device_stub__ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_14ParallelReduceINS0_22CombinedFunctorReducerIN4Test5Getrs26Functor_BatchedSerialGetrsINS_6DeviceINS_4CudaENS_9CudaSpaceEEENS_4ViewIPPPdJNS_10LayoutLeftESA_EEENSB_IPPiJSF_SA_EEENSB_ISD_JSF_SA_EEENS5_8ParamTagIN10KokkosBlas5Trans9TransposeEEENSM_4Algo6Level39UnblockedEEENS0_15FunctorAnalysisINS0_23FunctorPatternInterface6REDUCEENS_11RangePolicyIJS8_SP_EEEST_iE7ReducerEvEESY_S8_EEEEvT_(Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Kokkos::Cuda> const&) [0x25acc]
=========                in .../build-kernels/batched/dense/unit_test/KokkosKernels_batched_dla_cuda
=========     Host Frame:void Kokkos::Impl::__wrapper__device_stub_cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Kokkos::Cuda> >(Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Kokkos::Cuda> const&) [0x25aea]
=========                in .../build-kernels/batched/dense/unit_test/KokkosKernels_batched_dla_cuda

Yuuichi Asahi added 3 commits January 29, 2025 06:54
@yasahi-hpc
Copy link
Contributor Author

Fortunately (for my sanity) but unfortunately for #2485, the issue is not quite resolved. Slight difference to how it manifested in 2485:

__remote_shared__ read rather than __global__

[ RUN      ] Cuda.test_batched_getrs_nt_double
[       OK ] Cuda.test_batched_getrs_nt_double (212 ms)
[ RUN      ] Cuda.test_batched_getrs_t_double
.../kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.472168 vs 2.22045e-13
.../kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.632412 vs 2.22045e-13
.../kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.484741 vs 2.22045e-13
.../kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1.11126 vs 2.22045e-13
========= Invalid __remote_shared__ read of size 4 bytes
=========     at 0x2610 in void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double ***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<int **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<double **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>>, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double ***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<int **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<double **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>>, Kokkos::Cuda>>(T1)
=========     by thread (0,1,0) in block (0,0,0)
=========     Address 0x16ec9c in CTA 253 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x32e950]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x1490c]
=========                in /projects/x86-64-icelake-rocky8/tpls/cuda/12.0.0/gcc/11.3.0/base/ksndyya/lib64/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x6bb4b]
=========                in /projects/x86-64-icelake-rocky8/tpls/cuda/12.0.0/gcc/11.3.0/base/ksndyya/lib64/libcudart.so.12
=========     Host Frame:cudaError cudaLaunchKernel<char>(char const*, dim3, dim3, void**, unsigned long, CUstream_st*) [0x2d3a1]
=========                in .../build-kernels/batched/dense/unit_test/KokkosKernels_batched_dla_cuda
=========     Host Frame:__device_stub__ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_14ParallelReduceINS0_22CombinedFunctorReducerIN4Test5Getrs26Functor_BatchedSerialGetrsINS_6DeviceINS_4CudaENS_9CudaSpaceEEENS_4ViewIPPPdJNS_10LayoutLeftESA_EEENSB_IPPiJSF_SA_EEENSB_ISD_JSF_SA_EEENS5_8ParamTagIN10KokkosBlas5Trans9TransposeEEENSM_4Algo6Level39UnblockedEEENS0_15FunctorAnalysisINS0_23FunctorPatternInterface6REDUCEENS_11RangePolicyIJS8_SP_EEEST_iE7ReducerEvEESY_S8_EEEEvT_(Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Kokkos::Cuda> const&) [0x25acc]
=========                in .../build-kernels/batched/dense/unit_test/KokkosKernels_batched_dla_cuda
=========     Host Frame:void Kokkos::Impl::__wrapper__device_stub_cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Kokkos::Cuda> >(Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Kokkos::Cuda> const&) [0x25aea]
=========                in .../build-kernels/batched/dense/unit_test/KokkosKernels_batched_dla_cuda

After further investigation, I found an issue in Laswp.
Can you give it a try again with the new fix.

My conclusion is that there is a compiler bug in Cuda 12.0.0 which crashes Laswp with an aggressive loop unrolling. As far as I am concerned, the error happens specifically with Cuda 12.0.0 and H100, so I added a directive to disallow loop unrolling for Cuda 12.0.0 and H100.

@yasahi-hpc yasahi-hpc requested a review from lucbv January 29, 2025 15:18
Copy link
Contributor

@lucbv lucbv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, unfortunate if that's the only solution but let's test at least and see how it goes.

@cwpearson
Copy link
Contributor

Awesome work, this seems to have done the trick.

@cwpearson
Copy link
Contributor

We've got about 73 similar backwards loops in Kokkos Kernels by a quick count, I wonder if any of the others are causing problems for CUDA 12 / H100?

@yasahi-hpc
Copy link
Contributor Author

Thank you for the confirmation.

We've got about 73 similar backwards loops in Kokkos Kernels by a quick count, I wonder if any of the others are causing problems for CUDA 12 / H100?

Since all the unit-testing passes, I think it is fine.
It seems that the issue is nothing to do with the backwards loops. I have modified the very kernel with the forward loop to get the same error. It seems to happen specifically in getrs not even laswp alone.

@lucbv lucbv merged commit 5e3ea56 into kokkos:develop Feb 3, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Out of bounds memory access in Cuda.test_batched_getrs_t_double unit test (H100, Cuda 12.0.0)
3 participants