[FEA] Detect transposed tensor on reductions and switch to fast path #482

cliffburdick · 2023-09-08T15:13:23Z

Currently a reduction across columns vs rows is slow since it naively takes in a transpose operator and indexes it using a random access iterator. This causes adjacent threads to have a strided access equal to the second dimension.

We experimented with an einsum transpose, followed by a reduction, and the results were as follows on an A30:

permute/sum

|  T  | Tensor Size | NumElements |  DataSize   | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | GlobalMem BW | BWUtil | Samples | Batch GPU |
|-----|-------------|-------------|-------------|---------|-----------|-------|-----------|-------|---------|--------------|--------|---------|-----------|
| F32 |    2^6 = 64 |          64 |  64.000 MiB |    608x |  1.534 ms | 6.09% |  1.446 ms | 0.63% | 44.248K |  46.397 GB/s |  4.97% |    609x |  1.457 ms |
| F32 |   2^7 = 128 |         128 |   1.000 GiB |     17x | 31.070 ms | 0.31% | 30.981 ms | 0.09% |  4.132K |  34.658 GB/s |  3.71% |     18x | 30.965 ms |
| F64 |    2^6 = 64 |          64 | 128.000 MiB |    254x |  2.058 ms | 4.40% |  1.971 ms | 0.18% | 32.467K |  68.088 GB/s |  7.30% |    267x |  1.944 ms |
| F64 |   2^7 = 128 |         128 |   2.000 GiB |     16x | 32.480 ms | 0.29% | 32.391 ms | 0.07% |  3.952K |  66.298 GB/s |  7.11% |     17x | 32.367 ms |

einsum/sum

|  T  | Tensor Size | NumElements |  DataSize   | Samples |  CPU Time  | Noise  |  GPU Time  | Noise  |  Elem/s  | GlobalMem BW | BWUtil | Samples | Batch GPU  |
|-----|-------------|-------------|-------------|---------|------------|--------|------------|--------|----------|--------------|--------|---------|------------|
| F32 |    2^6 = 64 |          64 |  64.000 MiB |    944x | 585.716 us | 14.30% | 577.762 us | 14.06% | 110.772K | 116.153 GB/s | 12.45% |    991x | 504.905 us |
| F32 |   2^7 = 128 |         128 |   1.000 GiB |   1846x |   8.025 ms |  0.82% |   8.017 ms |  0.81% |  15.966K | 133.933 GB/s | 14.35% |   1847x |   8.007 ms |
| F64 |    2^6 = 64 |          64 | 128.000 MiB |   3136x | 668.751 us |  1.46% | 661.259 us |  0.91% |  96.785K | 202.973 GB/s | 21.75% |   3137x | 649.249 us |
| F64 |   2^7 = 128 |         128 |   2.000 GiB |   1380x |  10.776 ms |  1.44% |  10.768 ms |  1.44% |  11.887K | 199.424 GB/s | 21.37% |   1381x |  10.760 ms |

The einsum version is quite a bit faster since it hits SoL on the transpose. However, there can be room for improvement where a kernel aware of both transpose and reductions can be faster by tiling. This issue is to implement that kernel and compare performance.

The text was updated successfully, but these errors were encountered:

cliffburdick mentioned this issue Jul 26, 2024

[FEA] Consider custom reduction if using non-contiguous stride #581

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Detect transposed tensor on reductions and switch to fast path #482

[FEA] Detect transposed tensor on reductions and switch to fast path #482

cliffburdick commented Sep 8, 2023

[FEA] Detect transposed tensor on reductions and switch to fast path #482

[FEA] Detect transposed tensor on reductions and switch to fast path #482

Comments

cliffburdick commented Sep 8, 2023