[BUG] Intermittent result discrepancy for NDS SF3K query86 on L40S #11835

gerashegalov · 2024-12-06T23:04:23Z

Describe the bug
NDS SF3K CI pipeline exhibits intermittent query result validation failures for various queries.

It is difficult to reproduce but I was able to reduce the scenario to running q36 and q86 one after another, which fails 90+% out of the runs. I dropped LIMIT 100 from q86 to reduce chances of nondeterminism.

select
    sum(ws_net_paid) as total_sum
   ,i_category
   ,i_class
   ,grouping(i_category)+grouping(i_class) as lochierarchy
   ,rank() over (
        partition by grouping(i_category)+grouping(i_class),
        case when grouping(i_class) = 0 then i_category end
        order by sum(ws_net_paid) desc) as rank_within_parent
 from
    web_sales
   ,date_dim       d1
   ,item
 where
    d1.d_month_seq between 1205 and 1205+11
 and d1.d_date_sk = ws_sold_date_sk
 and i_item_sk  = ws_item_sk
 group by rollup(i_category,i_class)
 order by
   lochierarchy desc,
   case when lochierarchy = 0 then i_category end,
   rank_within_parent

The diff is large enough but there is a single row in the result and diff with lochierarchy=2 so it is to focus on for tests

-Row(total_sum=Decimal('988350515593.49'), i_category=None, i_class=None, lochierarchy=2, rank_within_parent=1)
+Row(total_sum=Decimal('988314086750.87'), i_category=None, i_class=None, lochierarchy=2, rank_within_parent=1)

This issue seems to be introduced between build 33

Archive:  /home/spark/gshegalov/dist/rapids/rapids-4-spark_2.12-24.12.0-20241101.120509-33.jar
  inflating: spark320/rapids4spark-private-version-info.properties  
version=24.12.0-SNAPSHOT
user=root
revision=16a524fd6e1cfdbd8997227682ac638373b698b4
branch=HEAD
date=2024-11-01T07:46:20Z
url=https://gitlab-master.nvidia.com/nvspark/spark-rapids-private.git

  inflating: rapids4spark-version-info.properties  
version=24.12.0-SNAPSHOT
cudf_version=24.12.0-SNAPSHOT
user=root
revision=2134f2eb665769c4f57a41104ce475d25a7224ee
branch=HEAD
date=2024-11-01T11:49:48Z
url=https://github.com/NVIDIA/spark-rapids.git

  inflating: cudf-java-version-info.properties  
version=24.12.0-SNAPSHOT
user=root
revision=9657c9a5dc4c4a1bf9fd7b55cfeb53c60dda3c66
branch=HEAD
date=2024-11-01T06:27:54Z
url=https://github.com/rapidsai/cudf.git
gpu_architectures=70;75;80;86;90

  inflating: spark-rapids-jni-version-info.properties  
version=24.12.0-SNAPSHOT
user=root
revision=86a9e16f555a8189d8bce45ed0bf9ba898585e09
branch=HEAD
date=2024-11-01T06:27:56Z
url=https://github.com/NVIDIA/spark-rapids-jni.git
gpu_architectures=70;75;80;86;90

and build 42

Archive:  /home/spark/gshegalov/dist/rapids/rapids-4-spark_2.12-24.12.0-20241113.144142-42.jar
  inflating: spark320/rapids4spark-private-version-info.properties  
version=24.12.0-SNAPSHOT
user=root
revision=2f08e20170b66621d1f14ee0fb351ef5630ea811
branch=HEAD
date=2024-11-13T06:34:20Z
url=https://gitlab-master.nvidia.com/nvspark/spark-rapids-private.git

  inflating: rapids4spark-version-info.properties  
version=24.12.0-SNAPSHOT
cudf_version=24.12.0-SNAPSHOT
user=root
revision=862dab0789abb95f860d1d42e06dad4690ca79e2
branch=HEAD
date=2024-11-13T14:24:52Z
url=https://github.com/NVIDIA/spark-rapids.git

  inflating: cudf-java-version-info.properties  
version=24.12.0-SNAPSHOT
user=root
revision=487f97c036ae7919e98ddc8bf5412a8002a493c5
branch=HEAD
date=2024-11-13T08:01:08Z
url=https://github.com/rapidsai/cudf.git
gpu_architectures=70;75;80;86;90

  inflating: spark-rapids-jni-version-info.properties  
version=24.12.0-SNAPSHOT
user=root
revision=ea47ecb58cab8228a81088b8ac44e3b336a9ba7e
branch=HEAD
date=2024-11-13T08:01:12Z
url=https://github.com/NVIDIA/spark-rapids-jni.git
gpu_architectures=70;75;80;86;90

Given that the runs are not 100% reproducible there is a chance the range is longer. I ran build 33 four times without reproducing the issue. Build 42 reproduces the failure quickly

Steps/Code to reproduce bug
Open the notebook on a single node with L40S https://github.com/gerashegalov/rapids-shell/blob/25ca477172f8ac45b71d0eed3452369299748284/src/jupyter/nds2-parquet-3k-snappy.ipynb

Expected behavior
Results must continue to match. These tests are consistently passing on the same node when configured to use an H100 GPU instead

Environment details (please complete the following information)

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:01:00.0 Off |                    0 |
| N/A   47C    P0             88W /  350W |   44863MiB /  46068MiB |      0%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

gerashegalov · 2024-12-12T21:28:53Z

Rerun the repro with in two more modifications

disable async pool memory allocator

 .config('spark.rapids.memory.gpu.pool', 'NONE')

the diff still consistently reproduces

added compute sanitizer with the default memcheck

the default check does not catch errors and seems to change concurrency in a way that the issue stops reproducing.

gerashegalov · 2024-12-16T19:26:49Z

Pursued a conjecture that the issue only reproduces due to forward compatibility because we have no cubin sections for compute capability 89

However, a targeted compilation for 89 equally reproduced the issue

gerashegalov · 2024-12-16T19:53:43Z

Running the executors under

compute-sanitizer --tool=initcheck

There are following issue classes:

Uninitialized global memory

One instance looks intentional given its name __uninitialized_copy in the stack trace.

========= Uninitialized __global__ memory read of size 8 bytes
=========     at void thrust::cuda_cub::__uninitialized_copy::functor<thrust::tuple<bool, long> *, thrust::pointer<thrust::tuple<bool, long>, thrust::deta
il::execute_with_allocator<rmm::mr::thrust_allocator<char>, thrust::cuda_cub::execute_on_stream_base>, thrust::use_default, thrust::use_default>>::operato
r ()<long>(T1)+0x1d0 in /home/gshegalov/gits/NVIDIA/spark-rapids-jni/target/libcudf/cmake-build/_deps/cccl-src/thrust/thrust/system/cuda/detail/uninitiali
zed_copy.h:80
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x1940226400
=========     Device Frame:void cub::CUB_200500_890_NS::detail::for_each::agent_block_striped_t<cub::CUB_200500_890_NS::detail::for_each::policy_t<(int)25
6, (int)2>, long, thrust::cuda_cub::__uninitialized_copy::functor<thrust::tuple<bool, long> *, thrust::pointer<thrust::tuple<bool, long>, thrust::detail::
execute_with_allocator<rmm::mr::thrust_allocator<char>, thrust::cuda_cub::execute_on_stream_base>, thrust::use_default, thrust::use_default>>>::consume_ti
le<(bool)0>(int, int)+0x120 in /home/gshegalov/gits/NVIDIA/spark-rapids-jni/target/libcudf/cmake-build/_deps/cccl-src/cub/cub/agent/agent_for.cuh:75
=========     Device Frame:void cub::CUB_200500_890_NS::detail::for_each::static_kernel<cub::CUB_200500_890_NS::detail::for_each::policy_hub_t::policy_350
_t, long, thrust::cuda_cub::__uninitialized_copy::functor<thrust::tuple<bool, long> *, thrust::pointer<thrust::tuple<bool, long>, thrust::detail::execute_
with_allocator<rmm::mr::thrust_allocator<char>, thrust::cuda_cub::execute_on_stream_base>, thrust::use_default, thrust::use_default>>>(T2, T3)+0x120 in /h
ome/gshegalov/gits/NVIDIA/spark-rapids-jni/target/libcudf/cmake-build/_deps/cccl-src/cub/cub/device/dispatch/dispatch_for.cuh:155
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x29d79f]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x39b458a]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0x39f41d7]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0x1a58a65]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0x1a5a408]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:cudf::io::parquet::detail::reader::impl::setup_next_pass(cudf::io::parquet::detail::reader::impl::read_mode) [0x1a5deee]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:cudf::io::parquet::detail::reader::impl::handle_chunking(cudf::io::parquet::detail::reader::impl::read_mode) [0x1a628ec]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:cudf::io::parquet::detail::reader::impl::read_chunk() [0x1a2f4a7]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:cudf::io::parquet::detail::chunked_reader::read_chunk() const [0x1a270bd]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:cudf::io::chunked_parquet_reader::read_chunk() const [0x18612df]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:Java_ai_rapids_cudf_ParquetChunkedReader_readChunk [0xb838c7]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0xffffffffe7b7a77e]

But the other is not

========= Uninitialized __global__ memory read of size 4 bytes
=========     at T1 cub::CUB_200500_890_NS::ThreadLoad<int>(T1 *, cub::CUB_200500_890_NS::Int2Type<(int)0>, cub::CUB_200500_890_NS::Int2Type<(int)1>)+0x3b0 in /home/gshegalov/gits/NVIDIA/spark-rapids-jni/target/libcudf/cmake-build/_deps/cccl-src/cub/cub/thread/thread_load.cuh:289
=========     by thread (2,0,0) in block (0,0,0)
=========     Address 0x1940224208
=========     Device Frame:std::iterator_traits<T2>::value_type cub::CUB_200500_890_NS::ThreadLoad<(cub::CUB_200500_890_NS::CacheLoadModifier)0, int *>(T2)+0x3b0 in /home/gshegalov/gits/NVIDIA/spark-rapids-jni/target/libcudf/cmake-build/_deps/cccl-src/cub/cub/thread/thread_load.cuh:354
=========     Device Frame:int cub::CUB_200500_890_NS::CacheModifiedInputIterator<(cub::CUB_200500_890_NS::CacheLoadModifier)0, int, int>::operator []<int>(T1) const+0x3b0 in /home/gshegalov/gits/NVIDIA/spark-rapids-jni/target/libcudf/cmake-build/_deps/cccl-src/cub/cub/iterator/cache_modified_input_iterator.cuh:217
=========     Device Frame:void cub::CUB_200500_890_NS::LoadDirectWarpStriped<int, (int)15, cub::CUB_200500_890_NS::CacheModifiedInputIterator<(cub::CUB_200500_890_NS::CacheLoadModifier)0, int, int>>(int, T3, T1 (&)[T2], int)+0x1e0 in /home/gshegalov/gits/NVIDIA/spark-rapids-jni/target/libcudf/cmake-build/_deps/cccl-src/cub/cub/block/block_load.cuh:521
=========     Device Frame:void cub::CUB_200500_890_NS::LoadDirectWarpStriped<int, int, (int)15, cub::CUB_200500_890_NS::CacheModifiedInputIterator<(cub::CUB_200500_890_NS::CacheLoadModifier)0, int, int>>(int, T4, T1 (&)[T3], int, T2)+0x1e0 in /home/gshegalov/gits/NVIDIA/spark-rapids-jni/target/libcudf/cmake-build/_deps/cccl-src/cub/cub/block/block_load.cuh:574
=========     Device Frame:void cub::CUB_200500_890_NS::BlockLoad<int, (int)128, (int)15, (cub::CUB_200500_890_NS::BlockLoadAlgorithm)4, (int)1, (int)1, (int)0>::LoadInternal<(cub::CUB_200500_890_NS::BlockLoadAlgorithm)4, (int)0>::Load<cub::CUB_200500_890_NS::CacheModifiedInputIterator<(cub::CUB_200500_890_NS::CacheLoadModifier)0, int, int>, int>(T1, int (&)[15], int, T2)+0x1e0 in /home/gshegalov/gits/NVIDIA/spark-rapids-jni/target/libcudf/cmake-build/_deps/cccl-src/cub/cub/block/block_load.cuh:1284
=========     Device Frame:void cub::CUB_200500_890_NS::BlockLoad<int, (int)128, (int)15, (cub::CUB_200500_890_NS::BlockLoadAlgorithm)4, (int)1, (int)1, (int)0>::Load<cub::CUB_200500_890_NS::CacheModifiedInputIterator<(cub::CUB_200500_890_NS::CacheLoadModifier)0, int, int>, int>(T1, int (&)[15], int, T2)+0x1e0 in /home/gshegalov/gits/NVIDIA/spark-rapids-jni/target/libcudf/cmake-build/_deps/cccl-src/cub/cub/block/block_load.cuh:1595
=========     Device Frame:void cub::CUB_200500_890_NS::AgentScan<cub::CUB_200500_890_NS::AgentScanPolicy<(int)128, (int)15, int, (cub::CUB_200500_890_NS::BlockLoadAlgorithm)4, (cub::CUB_200500_890_NS::CacheLoadModifier)0, (cub::CUB_200500_890_NS::BlockStoreAlgorithm)4, (cub::CUB_200500_890_NS::BlockScanAlgorithm)2, cub::CUB_200500_890_NS::MemBoundScaling<(int)128, (int)15, int>, cub::CUB_200500_890_NS::detail::fixed_delay_constructor_t<(unsigned int)350, (unsigned int)450>>, int *, int *, thrust::plus<void>, int, int, int>::ConsumeTile<(bool)1>(int, int, int, cub::CUB_200500_890_NS::ScanTileState<int, (bool)1> &)+0x1e0 in /home/gshegalov/gits/NVIDIA/spark-rapids-jni/target/libcudf/cmake-build/_deps/cccl-src/cub/cub/agent/agent_scan.cuh:344
=========     Device Frame:cub::CUB_200500_890_NS::AgentScan<cub::CUB_200500_890_NS::AgentScanPolicy<(int)128, (int)15, int, (cub::CUB_200500_890_NS::BlockLoadAlgorithm)4, (cub::CUB_200500_890_NS::CacheLoadModifier)0, (cub::CUB_200500_890_NS::BlockStoreAlgorithm)4, (cub::CUB_200500_890_NS::BlockScanAlgorithm)2, cub::CUB_200500_890_NS::MemBoundScaling<(int)128, (int)15, int>, cub::CUB_200500_890_NS::detail::fixed_delay_constructor_t<(unsigned int)350, (unsigned int)450>>, int *, int *, thrust::plus<void>, int, int, int>::ConsumeRange(int, cub::CUB_200500_890_NS::ScanTileState<int, (bool)1> &, int)+0x90 in /home/gshegalov/gits/NVIDIA/spark-rapids-jni/target/libcudf/cmake-build/_deps/cccl-src/cub/cub/agent/agent_scan.cuh:419
=========     Device Frame:void cub::CUB_200500_890_NS::DeviceScanKernel<cub::CUB_200500_890_NS::DeviceScanPolicy<int, thrust::plus<void>>::Policy900, int *, int *, cub::CUB_200500_890_NS::ScanTileState<int, (bool)1>, thrust::plus<void>, cub::CUB_200500_890_NS::detail::InputValue<int, int *>, int, int>(T2, T3, T4, int, T5, T6, T7)+0x70 in /home/gshegalov/gits/NVIDIA/spark-rapids-jni/target/libcudf/cmake-build/_deps/cccl-src/cub/cub/device/dispatch/dispatch_scan.cuh:197
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x29d79f]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x39b458a]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0x39f41d7]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0xab3f1a]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0x1a7f87b]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0x1a925fb]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0x1aa6b8b]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:cudf::io::parquet::detail::reader::impl::read_compressed_data() [0x1aa9b60]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:cudf::io::parquet::detail::reader::impl::setup_next_pass(cudf::io::parquet::detail::reader::impl::read_mode) [0x1a5d5bd]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:cudf::io::parquet::detail::reader::impl::handle_chunking(cudf::io::parquet::detail::reader::impl::read_mode) [0x1a628ec]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:cudf::io::parquet::detail::reader::impl::read_chunk() [0x1a2f4a7]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:cudf::io::parquet::detail::chunked_reader::read_chunk() const [0x1a270bd]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:cudf::io::chunked_parquet_reader::read_chunk() const [0x18612df]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:Java_ai_rapids_cudf_ParquetChunkedReader_readChunk [0xb838c7]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0xffffffffe7b7a77e]

Unused memory warnings

=========  Unused memory in allocation 0x1940225800 of size 1,536 bytes
=========     Not written 1,004 bytes between offsets 0x4 (0x1940225804) and 0x5ff (0x1940225dff) (inclusive)
=========     65.3646% of allocation were unused.
=========     Saved host backtrace up to driver entry point at allocation time
=========     Host Frame: [0x28d166]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x39b62ac]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0x39fe8c9]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:rmm::mr::cuda_async_view_memory_resource::do_allocate(unsigned long, rmm::cuda_stream_view) [0xbe10a3]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:rmm::mr::limiting_resource_adaptor<rmm::mr::device_memory_resource>::do_allocate(unsigned long, rmm::cuda_stream_view) [0xbdf94d]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0xbd5874]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0xbd6b11]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0xa13450]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0x1a7fdc4]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0x1ab3589]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0x1aa695f]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:cudf::io::parquet::detail::reader::impl::read_compressed_data() [0x1aa9b60]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:cudf::io::parquet::detail::reader::impl::setup_next_pass(cudf::io::parquet::detail::reader::impl::read_mode) [0x1a5d5bd]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:cudf::io::parquet::detail::reader::impl::handle_chunking(cudf::io::parquet::detail::reader::impl::read_mode) [0x1a628ec]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:cudf::io::parquet::detail::reader::impl::read_chunk() [0x1a2f4a7]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:cudf::io::parquet::detail::chunked_reader::read_chunk() const [0x1a270bd]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:cudf::io::chunked_parquet_reader::read_chunk() const [0x18612df]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame:Java_ai_rapids_cudf_ParquetChunkedReader_readChunk [0xb838c7]
=========                in /tmp/cudf8253500082090784182.so
=========     Host Frame: [0x716fd1017de6]
=========                in

gerashegalov added ? - Needs Triage Need team to review and classify bug Something isn't working labels Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Intermittent result discrepancy for NDS SF3K query86 on L40S #11835

[BUG] Intermittent result discrepancy for NDS SF3K query86 on L40S #11835

gerashegalov commented Dec 6, 2024 •

edited

Loading

gerashegalov commented Dec 12, 2024

gerashegalov commented Dec 16, 2024

gerashegalov commented Dec 16, 2024 •

edited

Loading

[BUG] Intermittent result discrepancy for NDS SF3K query86 on L40S #11835

[BUG] Intermittent result discrepancy for NDS SF3K query86 on L40S #11835

Comments

gerashegalov commented Dec 6, 2024 • edited Loading

gerashegalov commented Dec 12, 2024

disable async pool memory allocator

added compute sanitizer with the default memcheck

gerashegalov commented Dec 16, 2024

gerashegalov commented Dec 16, 2024 • edited Loading

Uninitialized global memory

Unused memory warnings

gerashegalov commented Dec 6, 2024 •

edited

Loading

gerashegalov commented Dec 16, 2024 •

edited

Loading