-
Notifications
You must be signed in to change notification settings - Fork 253
Update gemmStridedBatchedEx!
size checks
#2935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
88f01a4
to
0591ebd
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2935 +/- ##
==========================================
+ Coverage 88.98% 89.45% +0.47%
==========================================
Files 150 150
Lines 13078 13078
==========================================
+ Hits 11637 11699 +62
+ Misses 1441 1379 -62 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA.jl Benchmarks
Benchmark suite | Current: 0591ebd | Previous: f7deec6 | Ratio |
---|---|---|---|
latency/precompile |
56884307796 ns |
56924777734.5 ns |
1.00 |
latency/ttfp |
8391774083.5 ns |
8417873332.5 ns |
1.00 |
latency/import |
4506835308 ns |
4531361015 ns |
0.99 |
integration/volumerhs |
9610638 ns |
9625377 ns |
1.00 |
integration/byval/slices=1 |
146968 ns |
146827 ns |
1.00 |
integration/byval/slices=3 |
426103 ns |
425931 ns |
1.00 |
integration/byval/reference |
145020 ns |
144949 ns |
1.00 |
integration/byval/slices=2 |
286452 ns |
286317 ns |
1.00 |
integration/cudadevrt |
103608 ns |
103600 ns |
1.00 |
kernel/indexing |
14410 ns |
14225 ns |
1.01 |
kernel/indexing_checked |
15051 ns |
15087 ns |
1.00 |
kernel/occupancy |
716.9236111111111 ns |
679.8954248366013 ns |
1.05 |
kernel/launch |
2149.4444444444443 ns |
2150.3333333333335 ns |
1.00 |
kernel/rand |
15291 ns |
14810 ns |
1.03 |
array/reverse/1d |
20117.5 ns |
20182 ns |
1.00 |
array/reverse/2dL_inplace |
66853 ns |
66832.5 ns |
1.00 |
array/reverse/1dL |
70355 ns |
70358 ns |
1.00 |
array/reverse/2d |
22980 ns |
21865 ns |
1.05 |
array/reverse/1d_inplace |
11496 ns |
11480 ns |
1.00 |
array/reverse/2d_inplace |
13376 ns |
13272 ns |
1.01 |
array/reverse/2dL |
74966 ns |
73906 ns |
1.01 |
array/reverse/1dL_inplace |
66815 ns |
66817 ns |
1.00 |
array/copy |
21276 ns |
20949 ns |
1.02 |
array/iteration/findall/int |
158488.5 ns |
157295 ns |
1.01 |
array/iteration/findall/bool |
140199 ns |
139923.5 ns |
1.00 |
array/iteration/findfirst/int |
162221 ns |
161193 ns |
1.01 |
array/iteration/findfirst/bool |
162961 ns |
162272 ns |
1.00 |
array/iteration/scalar |
73293 ns |
73738 ns |
0.99 |
array/iteration/logical |
218788 ns |
214452.5 ns |
1.02 |
array/iteration/findmin/1d |
53667 ns |
50889.5 ns |
1.05 |
array/iteration/findmin/2d |
97313 ns |
96643 ns |
1.01 |
array/reductions/reduce/Int64/1d |
44330.5 ns |
43989 ns |
1.01 |
array/reductions/reduce/Int64/dims=1 |
45198 ns |
44879 ns |
1.01 |
array/reductions/reduce/Int64/dims=2 |
62058.5 ns |
61825 ns |
1.00 |
array/reductions/reduce/Int64/dims=1L |
89454 ns |
89232 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
88685 ns |
88384 ns |
1.00 |
array/reductions/reduce/Float32/1d |
38994 ns |
37163 ns |
1.05 |
array/reductions/reduce/Float32/dims=1 |
43425.5 ns |
47666 ns |
0.91 |
array/reductions/reduce/Float32/dims=2 |
60596 ns |
59848 ns |
1.01 |
array/reductions/reduce/Float32/dims=1L |
53007 ns |
52408 ns |
1.01 |
array/reductions/reduce/Float32/dims=2L |
73475 ns |
72122.5 ns |
1.02 |
array/reductions/mapreduce/Int64/1d |
44846 ns |
43666 ns |
1.03 |
array/reductions/mapreduce/Int64/dims=1 |
48412.5 ns |
47028 ns |
1.03 |
array/reductions/mapreduce/Int64/dims=2 |
62230 ns |
61661 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=1L |
89379 ns |
88863 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=2L |
89365 ns |
88192 ns |
1.01 |
array/reductions/mapreduce/Float32/1d |
38872.5 ns |
37065 ns |
1.05 |
array/reductions/mapreduce/Float32/dims=1 |
46520 ns |
42446.5 ns |
1.10 |
array/reductions/mapreduce/Float32/dims=2 |
60533 ns |
60229 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1L |
53350 ns |
52761 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=2L |
73581 ns |
72561 ns |
1.01 |
array/broadcast |
20586 ns |
20011 ns |
1.03 |
array/copyto!/gpu_to_gpu |
13319 ns |
11355.5 ns |
1.17 |
array/copyto!/cpu_to_gpu |
216698 ns |
216192 ns |
1.00 |
array/copyto!/gpu_to_cpu |
284398 ns |
283975.5 ns |
1.00 |
array/accumulate/Int64/1d |
125218 ns |
125034.5 ns |
1.00 |
array/accumulate/Int64/dims=1 |
84241 ns |
83398 ns |
1.01 |
array/accumulate/Int64/dims=2 |
158674 ns |
157817 ns |
1.01 |
array/accumulate/Int64/dims=1L |
1709855.5 ns |
1708490 ns |
1.00 |
array/accumulate/Int64/dims=2L |
966521 ns |
966251 ns |
1.00 |
array/accumulate/Float32/1d |
109885 ns |
109114 ns |
1.01 |
array/accumulate/Float32/dims=1 |
80855 ns |
80351 ns |
1.01 |
array/accumulate/Float32/dims=2 |
148692 ns |
147295.5 ns |
1.01 |
array/accumulate/Float32/dims=1L |
1618915 ns |
1618020.5 ns |
1.00 |
array/accumulate/Float32/dims=2L |
698676.5 ns |
698067 ns |
1.00 |
array/construct |
1313.4 ns |
1296.7 ns |
1.01 |
array/random/randn/Float32 |
45969.5 ns |
48838.5 ns |
0.94 |
array/random/randn!/Float32 |
25242 ns |
24912 ns |
1.01 |
array/random/rand!/Int64 |
27490 ns |
27275 ns |
1.01 |
array/random/rand!/Float32 |
8972.333333333334 ns |
8805.333333333334 ns |
1.02 |
array/random/rand/Int64 |
30129 ns |
30044 ns |
1.00 |
array/random/rand/Float32 |
13306 ns |
13354 ns |
1.00 |
array/permutedims/4d |
60499 ns |
60446 ns |
1.00 |
array/permutedims/2d |
54612 ns |
54105.5 ns |
1.01 |
array/permutedims/3d |
55720 ns |
54893 ns |
1.02 |
array/sorting/1d |
2759021 ns |
2756483.5 ns |
1.00 |
array/sorting/by |
3345765 ns |
3368977 ns |
0.99 |
array/sorting/2d |
1082960 ns |
1088064.5 ns |
1.00 |
cuda/synchronization/stream/auto |
1015.8 ns |
1030.1 ns |
0.99 |
cuda/synchronization/stream/nonblocking |
7615 ns |
7504.6 ns |
1.01 |
cuda/synchronization/stream/blocking |
810.054347826087 ns |
801.2842105263157 ns |
1.01 |
cuda/synchronization/context/auto |
1178.7 ns |
1179.3 ns |
1.00 |
cuda/synchronization/context/nonblocking |
7358.2 ns |
7293.5 ns |
1.01 |
cuda/synchronization/context/blocking |
909.88 ns |
909.9636363636364 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
This changes the size checks in
gemmStridedBatchedEx!
to match the ones ingemm_strided_batched!
. Now this works