Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kokkos kernels: broken unit test w/ cuda 12.4 on h100 gpus with UVM enabled #2316

Open
vasylivy opened this issue Aug 27, 2024 · 11 comments
Open
Assignees

Comments

@vasylivy
Copy link

Hi,

I've been testing trilinos and came across a broken kk unit tests on h100s w/ cuda 12.4. I have not tried to reproduce the broken test stand alone but figured I'd report it. See configuration 1 reported here trilinos/Trilinos#13397. Following test fails

KokkosKernels_blocksparse_cuda_MPI_1
mCuda.sparse_bsr_gauss_seidel_rank1_double_int_int_TestDevice
(ptr->cuda_stream_synchronize_wrapper(stream)) error( cudaErrorIllegalAddress): an illegal memory access was encountered 

Thanks,

Yaro

@lucbv
Copy link
Contributor

lucbv commented Aug 27, 2024

Thanks for reporting this @vasylivy we will have a look!

@ndellingwood
Copy link
Contributor

I tested on Blake with cuda/12.0+gcc/11.3.0 on H100 (cuda/12.4 is available there but the driver only supports up to cuda/12.2)

The test passes in both cases where no TPLs are enabled and when CUSPARSE is enabled

Here are some reference notes on attempts to reproduce (no TPLs enabled in post below)

ssh blake
salloc -N 1 -p H100
module load cmake gcc/11.3.0 cuda/12.0.0

# kokkos configuration
cmake -DCMAKE_CXX_COMPILER=$KOKKOS_PATH/bin/nvcc_wrapper -DCMAKE_INSTALL_PREFIX=$KOKKOS_INSTALL -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_H100=ON -DKokkos_ENABLE_TESTS=OFF -DKokkos_ENABLE_EXAMPLES=OFF -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_CXX_EXTENSIONS=OFF -DBUILD_SHARED_LIBS=OFF -DKokkos_ENABLE_DEPRECATION_WARNINGS=OFF -DKokkos_ENABLE_DEPRECATED_CODE_4=OFF -DKokkos_ENABLE_DEPRECATION_WARNINGS=OFF $KOKKOS_PATH

# kokkos-kernels configuration
cmake -DCMAKE_CXX_COMPILER=$KOKKOS_PATH/bin/nvcc_wrapper -DKokkos_DIR=$KOKKOS_INSTALL/lib64/cmake/Kokkos -DKokkosKernels_ENABLE_TESTS_AND_PERFSUITE=OFF -DKokkosKernels_ENABLE_TESTS=ON -DKokkosKernels_ENABLE_PERFTESTS=ON -DKokkosKernels_ENABLE_EXAMPLES:BOOL=ON -DCMAKE_EXPORT_COMPILE_COMMANDS:BOOL=OFF -DKokkosKernels_ENABLE_TPL_ROCSPARSE=OFF -DKokkosKernels_ENABLE_TPL_ROCBLAS=OFF -DKokkosKernels_ENABLE_TPL_CUSOLVER=OFF -DKokkosKernels_ENABLE_TPL_CUSPARSE=OFF -DKokkosKernels_ENABLE_TPL_CUBLAS=OFF -DBUILD_SHARED_LIBS=OFF -DKokkosKernels_ENABLE_DOCS=OFF $KOKKOSKERNELS_PATH

I'm not sure at the moment where to test on H100 with cuda/12.4 , will need to find machine

@vasylivy
Copy link
Author

The other configuration that was a slight tweak of config 1 in that issue did pass all tests. Machine is down at the moment so can't test things. Is UVM enabled by default with kokkos?

Yaro

@ndellingwood
Copy link
Contributor

@vasylivy ah, I didn't enable UVM in my testing I'll do that now and retest

@ndellingwood
Copy link
Contributor

Yep, enabling UVM I see the same failure with 12.0 on H100 in the build with TPLs enabled:

[ RUN      ] Cuda.sparse_bsr_gauss_seidel_rank1_double_int_int_TestDevice
(ptr->cuda_stream_synchronize_wrapper(stream)) error( cudaErrorInvalidAddressSpace): operation not supported on global/shared address space /home/ndellin/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:165
Backtrace:
[0x9e1813] 
[0x9c2af7] 
[0x9e7661] 
[0x9e7f9a] 
[0x8f6f7a] 
[0x8f0ee4] 
[0x8f09b2] 
[0x7372f9] 
[0x7c8c6f] 
[0x7c2849] 
[0x7bf2a6] 
[0x493b0e] 
[0x46d19a] 
[0x44e367] 
[0x42f12a] 
[0x423b25] 
[0x40eef9] 
[0x9b7dee] 
[0x9b3656] 
[0x99b704] 
[0x99be88] 
[0x99c44d] 
[0x9a2ab0] 
[0x9b8e6b] 
[0x9b4432] 
[0x9a1980] 
[0x40dc60] 
[0x40d739] 
[0x7fda40bfed85] __libc_start_main
[0x40d5fe] 

@ndellingwood
Copy link
Contributor

Reproducer configuration notes for Blake:

ssh blake
salloc -N 1 -p H100
module load cmake gcc/11.3.0 cuda/12.0.0

# kokkos configuration
cmake -DCMAKE_CXX_COMPILER=$KOKKOS_PATH/bin/nvcc_wrapper -DCMAKE_INSTALL_PREFIX=$KOKKOS_INSTALL -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_H100=ON -DKokkos_ENABLE_TESTS=OFF -DKokkos_ENABLE_EXAMPLES=OFF -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_CXX_EXTENSIONS=OFF -DBUILD_SHARED_LIBS=OFF -DKokkos_ENABLE_DEPRECATION_WARNINGS=OFF -DKokkos_ENABLE_DEPRECATED_CODE_4=ON -DKokkos_ENABLE_DEPRECATION_WARNINGS=OFF -DKokkos_ENABLE_CUDA_UVM=ON $KOKKOS_PATH

# kokkos-kernels configuration
cmake -DCMAKE_CXX_COMPILER=$KOKKOS_PATH/bin/nvcc_wrapper -DKokkos_DIR=$KOKKOS_INSTALL/lib64/cmake/Kokkos -DKokkosKernels_ENABLE_TESTS_AND_PERFSUITE=OFF -DKokkosKernels_ENABLE_TESTS=ON -DKokkosKernels_ENABLE_PERFTESTS=ON -DKokkosKernels_ENABLE_EXAMPLES:BOOL=ON -DCMAKE_EXPORT_COMPILE_COMMANDS:BOOL=OFF -DKokkosKernels_ENABLE_TPL_ROCSPARSE=OFF -DKokkosKernels_ENABLE_TPL_ROCBLAS=OFF -DKokkosKernels_ENABLE_TPL_CUSOLVER=OFF -DKokkosKernels_ENABLE_TPL_CUSPARSE=ON -DKokkosKernels_ENABLE_TPL_CUBLAS=ON -DBUILD_SHARED_LIBS=OFF -DKokkosKernels_ENABLE_DOCS=OFF  -DKokkosKernels_INST_MEMSPACE_CUDAUVMSPACE=ON $KOKKOSKERNELS_PATH
  • Enabling UVM required also enabling setting -DKokkos_ENABLE_DEPRECATED_CODE_4=ON

@ndellingwood
Copy link
Contributor

These graph tests also failed in that build:

[  FAILED  ] Cuda.graph_random_graph_coarsen_double_int_int_TestDevice
[  FAILED  ] Cuda.graph_grid_graph_multilevel_coarsen_double_int_int_TestDevice

@ndellingwood
Copy link
Contributor

Looks like the issue exists with other cuda compilers on Hopper as well with UVM enabled

cuda/11.8.0+gcc/11.3.0:

12:41:09 The following tests FAILED:
12:41:09 	 13 - graph_cuda (NUMERICAL)
12:41:09 	 15 - sparse_cuda (Subprocess aborted)
12:41:09 	 16 - blocksparse_cuda (Timeout)

More details:

12:10:08 [ RUN      ] Cuda.graph_random_graph_coarsen_double_int_int_TestDevice
12:10:08 /home/jenkins/blake-new/workspace/KokkosKernels_Nightly_Blake_Cuda_11_8_0_Gcc_11_3_0_Hopper90-cusparse-cublas-uvm/kokkos-kernels/graph/unit_test/Test_Graph_coarsen.hpp:353: Failure
12:10:08 Value of: correct_graph
12:10:08   Actual: false
12:10:08 Expected: true
12:10:08 Coarsening with dedupe method 1 produced invalid graph with aggregation heuristic 3.
...
12:11:25 [ RUN      ] Cuda.sparse_gauss_seidel_asymmetric_rank1_double_int64_t_int_TestDevice
12:11:25 (ptr->cuda_stream_synchronize_wrapper(stream)) error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/blake-new/workspace/KokkosKernels_Nightly_Blake_Cuda_11_8_0_Gcc_11_3_0_Hopper90-cusparse-cublas-uvm/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:165
12:11:25 Backtrace:
12:11:25 [0x1f03ee3] 
12:11:25 [0x1ef3b57] 
12:11:25 [0x1f09d31] 
12:11:25 [0x1f0a66a] 
12:11:25 [0x1a41347] 
12:11:25 [0x1a37d79] 
12:11:25 [0x1a2b04e] 
12:11:25 [0x1a22dd5] 
12:11:25 [0x1a1bbe1] 
12:11:25 [0x1a157b0] 
12:11:25 [0x1a13d1d] 
12:11:25 [0x83c813] 
12:11:25 [0x6b65e9] 
12:11:25 [0x55d549] 
12:11:25 [0x554c91] 
12:11:25 [0x487bd8] 
12:11:25 [0x4150d7] 
12:11:25 [0x1ee9e22] 
12:11:25 [0x1ee5cb6] 
12:11:25 [0x1ece47e] 
12:11:25 [0x1ecec02] 
12:11:25 [0x1ecf1c7] 
12:11:25 [0x1ed582a] 
12:11:25 [0x1eeae8f] 
12:11:25 [0x1ee691c] 
12:11:25 [0x1ed46fa] 
12:11:25 [0x40fd80] 
12:11:25 [0x40f859] 
12:11:25 [0x7f2fd3116d85] __libc_start_main
12:11:25 [0x40f71e] 
...
12:36:25 [ RUN      ] Cuda.sparse_bsr_gauss_seidel_rank2_double_int_size_t_TestDevice
# Timeout after 1500 sec

Similar with cuda/12.0, with or without TPLs

@ndellingwood ndellingwood changed the title kokkos kernels: broken unit test w/ cuda 12.4 on h100 gpus kokkos kernels: broken unit test w/ cuda 12.4 on h100 gpus with UVM enabled Aug 27, 2024
@ndellingwood
Copy link
Contributor

ndellingwood commented Aug 27, 2024

Setting export CUDA_LAUNCH_BLOCKING=1 and export CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 allowed blocksparse_cuda to pass, the other still failed

Edit: this refers to a cuda/12.0 build with UVM enabled on Hopper, no tpls

@ndellingwood
Copy link
Contributor

ndellingwood commented Aug 27, 2024

An added data point, I tested another configuration combo with the cuda/12.0 H100 no-tpl build, with UVM disabled in Kokkos but still enabled in KokkosKernels, so these changes to the Kokkos config

-DKokkos_ENABLE_DEPRECATED_CODE_4=OFF -DKokkos_ENABLE_CUDA_UVM=OFF

while leaving -DKokkosKernels_INST_MEMSPACE_CUDAUVMSPACE=ON for the kokkos-kernels config

In this case:

  • graph_cuda passed
  • sparse_cuda failed in a different subtest
[ RUN      ] Cuda.sparse_spiluk_double_int_int_TestDevice
(ptr->cuda_stream_synchronize_wrapper(stream)) error( cudaErrorMisalignedAddress): misaligned address /home/ndellin/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:165
Backtrace:
[0xf4dfe3] 
[0xf3da67] 
[0xf53e51] 
[0xf5476a] 
[0xa3f793] 
[0xa3ee2d] 
[0xa3e9ff] 
[0xa3e6d3] 
[0xa3e45f] 
[0xa3e2d5] 
[0x6bf4b0] 
[0x5ecc62] 
[0x6c37c3] 
[0x5f716d] 
[0x5eae74] 
[0x4f4d8f] 
[0x46d5bf] 
[0x4147c7] 
[0xf33d28] 
[0xf2fb84] 
[0xf1834c] 
[0xf18ad0] 
[0xf19095] 
[0xf1f6f8] 
[0xf34d95] 
[0xf307ea] 
[0xf1e5c8] 
[0x40ec90] 
[0x40e769] 
[0x7f84455c5d85] __libc_start_main
[0x40e62e] 
  • blocksparse_cuda passed

Edit: to clarify, the testing results here are consistent with and without deprecated code (the same in either case of -DKokkos_ENABLE_DEPRECATED_CODE_4=OFF or -DKokkos_ENABLE_DEPRECATED_CODE_4=ON)

@ndellingwood
Copy link
Contributor

Summarizing the multiple comments I added above:

Testing on Blake H100 queue (Hopper GPUs) with cuda/12.0 and no tpls enabled

This table summarizes the UVM combo triggering test failures:

failing test Kokkos_ENABLE_CUDA_UVM DKokkosKernels_INST_MEMSPACE_CUDAUVMSPACE notes
Cuda.graph_random_graph_coarsen_double_int_int_TestDevice on on
Cuda.sparse_gauss_seidel_asymmetric_rank1_double_int64_t_int_TestDevice on on
Cuda.sparse_bsr_gauss_seidel_rank2_double_int_size_t_TestDevice on on timeout - passes with CUDA_LAUNCH_BLOCKING=1
Cuda.sparse_spiluk_double_int_int_TestDevice off on
  • Setting -DKokkos_ENABLE_CUDA_UVM=ON requires -DKokkos_ENABLE_DEPRECATED_CODE_4=ON

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants