Skip to content

Conversation

lpawela
Copy link
Contributor

@lpawela lpawela commented Oct 10, 2025

This changes the size checks in gemmStridedBatchedEx! to match the ones in gemm_strided_batched!. Now this works

using CUDA

N = 3
nbatch = 4

A = CUDA.rand(N, N, nbatch)
B = CUDA.rand(N, N)
C = CUDA.zeros(N, N, nbatch)
CUDA.CUBLAS.gemmStridedBatchedEx!('N', 'N', 1, A, reshape(B, size(B)..., 1), 0, C)
all(A[:, :, i] * B  C[:, :, i] for i=1:nbatch)

Copy link

codecov bot commented Oct 11, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.45%. Comparing base (f7deec6) to head (0591ebd).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2935      +/-   ##
==========================================
+ Coverage   88.98%   89.45%   +0.47%     
==========================================
  Files         150      150              
  Lines       13078    13078              
==========================================
+ Hits        11637    11699      +62     
+ Misses       1441     1379      -62     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Benchmark suite Current: 0591ebd Previous: f7deec6 Ratio
latency/precompile 56884307796 ns 56924777734.5 ns 1.00
latency/ttfp 8391774083.5 ns 8417873332.5 ns 1.00
latency/import 4506835308 ns 4531361015 ns 0.99
integration/volumerhs 9610638 ns 9625377 ns 1.00
integration/byval/slices=1 146968 ns 146827 ns 1.00
integration/byval/slices=3 426103 ns 425931 ns 1.00
integration/byval/reference 145020 ns 144949 ns 1.00
integration/byval/slices=2 286452 ns 286317 ns 1.00
integration/cudadevrt 103608 ns 103600 ns 1.00
kernel/indexing 14410 ns 14225 ns 1.01
kernel/indexing_checked 15051 ns 15087 ns 1.00
kernel/occupancy 716.9236111111111 ns 679.8954248366013 ns 1.05
kernel/launch 2149.4444444444443 ns 2150.3333333333335 ns 1.00
kernel/rand 15291 ns 14810 ns 1.03
array/reverse/1d 20117.5 ns 20182 ns 1.00
array/reverse/2dL_inplace 66853 ns 66832.5 ns 1.00
array/reverse/1dL 70355 ns 70358 ns 1.00
array/reverse/2d 22980 ns 21865 ns 1.05
array/reverse/1d_inplace 11496 ns 11480 ns 1.00
array/reverse/2d_inplace 13376 ns 13272 ns 1.01
array/reverse/2dL 74966 ns 73906 ns 1.01
array/reverse/1dL_inplace 66815 ns 66817 ns 1.00
array/copy 21276 ns 20949 ns 1.02
array/iteration/findall/int 158488.5 ns 157295 ns 1.01
array/iteration/findall/bool 140199 ns 139923.5 ns 1.00
array/iteration/findfirst/int 162221 ns 161193 ns 1.01
array/iteration/findfirst/bool 162961 ns 162272 ns 1.00
array/iteration/scalar 73293 ns 73738 ns 0.99
array/iteration/logical 218788 ns 214452.5 ns 1.02
array/iteration/findmin/1d 53667 ns 50889.5 ns 1.05
array/iteration/findmin/2d 97313 ns 96643 ns 1.01
array/reductions/reduce/Int64/1d 44330.5 ns 43989 ns 1.01
array/reductions/reduce/Int64/dims=1 45198 ns 44879 ns 1.01
array/reductions/reduce/Int64/dims=2 62058.5 ns 61825 ns 1.00
array/reductions/reduce/Int64/dims=1L 89454 ns 89232 ns 1.00
array/reductions/reduce/Int64/dims=2L 88685 ns 88384 ns 1.00
array/reductions/reduce/Float32/1d 38994 ns 37163 ns 1.05
array/reductions/reduce/Float32/dims=1 43425.5 ns 47666 ns 0.91
array/reductions/reduce/Float32/dims=2 60596 ns 59848 ns 1.01
array/reductions/reduce/Float32/dims=1L 53007 ns 52408 ns 1.01
array/reductions/reduce/Float32/dims=2L 73475 ns 72122.5 ns 1.02
array/reductions/mapreduce/Int64/1d 44846 ns 43666 ns 1.03
array/reductions/mapreduce/Int64/dims=1 48412.5 ns 47028 ns 1.03
array/reductions/mapreduce/Int64/dims=2 62230 ns 61661 ns 1.01
array/reductions/mapreduce/Int64/dims=1L 89379 ns 88863 ns 1.01
array/reductions/mapreduce/Int64/dims=2L 89365 ns 88192 ns 1.01
array/reductions/mapreduce/Float32/1d 38872.5 ns 37065 ns 1.05
array/reductions/mapreduce/Float32/dims=1 46520 ns 42446.5 ns 1.10
array/reductions/mapreduce/Float32/dims=2 60533 ns 60229 ns 1.01
array/reductions/mapreduce/Float32/dims=1L 53350 ns 52761 ns 1.01
array/reductions/mapreduce/Float32/dims=2L 73581 ns 72561 ns 1.01
array/broadcast 20586 ns 20011 ns 1.03
array/copyto!/gpu_to_gpu 13319 ns 11355.5 ns 1.17
array/copyto!/cpu_to_gpu 216698 ns 216192 ns 1.00
array/copyto!/gpu_to_cpu 284398 ns 283975.5 ns 1.00
array/accumulate/Int64/1d 125218 ns 125034.5 ns 1.00
array/accumulate/Int64/dims=1 84241 ns 83398 ns 1.01
array/accumulate/Int64/dims=2 158674 ns 157817 ns 1.01
array/accumulate/Int64/dims=1L 1709855.5 ns 1708490 ns 1.00
array/accumulate/Int64/dims=2L 966521 ns 966251 ns 1.00
array/accumulate/Float32/1d 109885 ns 109114 ns 1.01
array/accumulate/Float32/dims=1 80855 ns 80351 ns 1.01
array/accumulate/Float32/dims=2 148692 ns 147295.5 ns 1.01
array/accumulate/Float32/dims=1L 1618915 ns 1618020.5 ns 1.00
array/accumulate/Float32/dims=2L 698676.5 ns 698067 ns 1.00
array/construct 1313.4 ns 1296.7 ns 1.01
array/random/randn/Float32 45969.5 ns 48838.5 ns 0.94
array/random/randn!/Float32 25242 ns 24912 ns 1.01
array/random/rand!/Int64 27490 ns 27275 ns 1.01
array/random/rand!/Float32 8972.333333333334 ns 8805.333333333334 ns 1.02
array/random/rand/Int64 30129 ns 30044 ns 1.00
array/random/rand/Float32 13306 ns 13354 ns 1.00
array/permutedims/4d 60499 ns 60446 ns 1.00
array/permutedims/2d 54612 ns 54105.5 ns 1.01
array/permutedims/3d 55720 ns 54893 ns 1.02
array/sorting/1d 2759021 ns 2756483.5 ns 1.00
array/sorting/by 3345765 ns 3368977 ns 0.99
array/sorting/2d 1082960 ns 1088064.5 ns 1.00
cuda/synchronization/stream/auto 1015.8 ns 1030.1 ns 0.99
cuda/synchronization/stream/nonblocking 7615 ns 7504.6 ns 1.01
cuda/synchronization/stream/blocking 810.054347826087 ns 801.2842105263157 ns 1.01
cuda/synchronization/context/auto 1178.7 ns 1179.3 ns 1.00
cuda/synchronization/context/nonblocking 7358.2 ns 7293.5 ns 1.01
cuda/synchronization/context/blocking 909.88 ns 909.9636363636364 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant