AoS / SoA Copy Benchmarks, main branch (2024.09.27.) #297

krasznaa · 2024-09-27T09:06:14Z

While working on acts-project/traccc#712 yesterday, I was surprised to see how expensive it apparently is to copy a few megabytes of cell information from one host location to another. What I saw in NSight Systems was that doing a vecmem::edm::host -> vecmem::edm::buffer host-to-host copy was very comparable to then doing a vecmem::edm::buffer -> vecmem::edm::buffer host-to-device copy.

As a reminder, the host-to-host step would seem to be useful to copy the entire payload of an SoA container in one step, instead of copying its payload column-by-column.

But as it turns out, the overhead of copying a cell collection in 5 steps instead of one (a traccc cell has only 5 variables) is negligible compared to how long it takes to copy a few megabytes from one place to another in host memory. 😕

So in this PR I want to see exactly how copying the same sort of EDM, once in AoS and then in SoA form, would compare with each other. Right now, with only the host copies existing, I get:

[bash][pcadp04]:vecmem > ./build/bin/vecmem_benchmark_core --benchmark_filter="AoS|SoA"
2024-09-27T11:02:20+02:00
Running ./build/bin/vecmem_benchmark_core
Run on (48 X 1796.81 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x24)
  L1 Instruction 32 KiB (x24)
  L2 Unified 512 KiB (x24)
  L3 Unified 32768 KiB (x4)
Load Average: 0.06, 0.07, 0.21
---------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------
simpleSoADirectHostToFixedBufferCopy/1               47.0 ns         46.9 ns     14915255 Bytes=16 Rate=325.124M/s
simpleSoADirectHostToFixedBufferCopy/8               44.8 ns         44.7 ns     15652590 Bytes=128 Rate=2.66556G/s
simpleSoADirectHostToFixedBufferCopy/64              48.8 ns         48.7 ns     14353059 Bytes=1024 Rate=19.5998G/s
simpleSoADirectHostToFixedBufferCopy/512              113 ns          113 ns      6180930 Bytes=8k Rate=67.4712G/s
simpleSoADirectHostToFixedBufferCopy/4096            1237 ns         1235 ns       566747 Bytes=64k Rate=49.4234G/s
simpleSoADirectHostToFixedBufferCopy/32768          15540 ns        15496 ns        45228 Bytes=512k Rate=31.5094G/s
simpleSoADirectHostToFixedBufferCopy/262144        124628 ns       124264 ns         5585 Bytes=4M Rate=31.435G/s
simpleSoADirectHostToFixedBufferCopy/2097152      1440522 ns      1434445 ns          487 Bytes=32M Rate=21.7854G/s
simpleSoADirectHostToFixedBufferCopy/16777216     9900081 ns      9868170 ns           62 Bytes=256M Rate=25.334G/s
simpleSoADirectHostToFixedBufferCopy/67108864    44172606 ns     44038170 ns           14 Bytes=1024M Rate=22.7076G/s
simpleSoAOptimalHostToFixedBufferCopy/1              25.4 ns         25.4 ns     27560671 Bytes=16 Rate=600.753M/s
simpleSoAOptimalHostToFixedBufferCopy/8              25.5 ns         25.4 ns     27556719 Bytes=128 Rate=4.69248G/s
simpleSoAOptimalHostToFixedBufferCopy/64             31.3 ns         31.3 ns     22272491 Bytes=1024 Rate=30.4965G/s
simpleSoAOptimalHostToFixedBufferCopy/512            90.7 ns         90.5 ns      7724358 Bytes=8k Rate=84.263G/s
simpleSoAOptimalHostToFixedBufferCopy/4096           1212 ns         1210 ns       579650 Bytes=64k Rate=50.4524G/s
simpleSoAOptimalHostToFixedBufferCopy/32768         15656 ns        15610 ns        45481 Bytes=512k Rate=31.2805G/s
simpleSoAOptimalHostToFixedBufferCopy/262144       124211 ns       123845 ns         5603 Bytes=4M Rate=31.5414G/s
simpleSoAOptimalHostToFixedBufferCopy/2097152     1148184 ns      1144388 ns          601 Bytes=32M Rate=27.3072G/s
simpleSoAOptimalHostToFixedBufferCopy/16777216    9687522 ns      9655885 ns           63 Bytes=256M Rate=25.8909G/s
simpleSoAOptimalHostToFixedBufferCopy/67108864   47272263 ns     47123369 ns           13 Bytes=1024M Rate=21.2209G/s
simpleAoSHostToFixedBufferCopy/1                     19.4 ns         19.3 ns     36222530 Bytes=16 Rate=789.511M/s
simpleAoSHostToFixedBufferCopy/8                     19.4 ns         19.3 ns     36219083 Bytes=128 Rate=6.16873G/s
simpleAoSHostToFixedBufferCopy/64                    23.5 ns         23.5 ns     29830658 Bytes=1024 Rate=40.6423G/s
simpleAoSHostToFixedBufferCopy/512                   79.7 ns         79.5 ns      8805052 Bytes=8k Rate=95.9561G/s
simpleAoSHostToFixedBufferCopy/4096                  1192 ns         1190 ns       587952 Bytes=64k Rate=51.2917G/s
simpleAoSHostToFixedBufferCopy/32768                15308 ns        15264 ns        45633 Bytes=512k Rate=31.9891G/s
simpleAoSHostToFixedBufferCopy/262144              123389 ns       123034 ns         5707 Bytes=4M Rate=31.7494G/s
simpleAoSHostToFixedBufferCopy/2097152            1143912 ns      1140204 ns          599 Bytes=32M Rate=27.4074G/s
simpleAoSHostToFixedBufferCopy/16777216           9814677 ns      9783654 ns           62 Bytes=256M Rate=25.5528G/s
simpleAoSHostToFixedBufferCopy/67108864          47313324 ns     47164521 ns           13 Bytes=1024M Rate=21.2024G/s
[bash][pcadp04]:vecmem >

Many aspects of these results I believe I understand. But I'm really not sure why the copy speed drops as it does for large sizes. 😕

In any case, I plan to continue the investigation...

So that it would be easier to set up the CUDA, HIP and SYCL tests as a next step.

Simply copying the current CUDA benchmark code, with all its imperfections.

krasznaa added the tests This issue or pull request is related to the test suite label Sep 27, 2024

krasznaa requested review from paulgessinger, stephenswat and beomki-yeo September 27, 2024 09:06

krasznaa force-pushed the SoACopyBenchmarking-main-20240927 branch from b22d293 to f828c4b Compare September 27, 2024 15:51

krasznaa added 4 commits October 1, 2024 15:31

Introduced some simple AoS / SoA copy benchmarks.

fdf74a6

Refactored the AoS and SoA copy tests.

fa5b4f1

So that it would be easier to set up the CUDA, HIP and SYCL tests as a next step.

Introduced EDM copy benchmarks for CUDA and SYCL as well.

02f8a25

Introduced benchmarks for HIP.

cbc49a0

Simply copying the current CUDA benchmark code, with all its imperfections.

krasznaa force-pushed the SoACopyBenchmarking-main-20240927 branch from f828c4b to cbc49a0 Compare October 1, 2024 13:52

krasznaa marked this pull request as ready for review October 1, 2024 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AoS / SoA Copy Benchmarks, main branch (2024.09.27.) #297

AoS / SoA Copy Benchmarks, main branch (2024.09.27.) #297

krasznaa commented Sep 27, 2024

AoS / SoA Copy Benchmarks, main branch (2024.09.27.) #297

Are you sure you want to change the base?

AoS / SoA Copy Benchmarks, main branch (2024.09.27.) #297

Conversation

krasznaa commented Sep 27, 2024