Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AoS / SoA Copy Benchmarks, main branch (2024.09.27.) #297

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

krasznaa
Copy link
Member

While working on acts-project/traccc#712 yesterday, I was surprised to see how expensive it apparently is to copy a few megabytes of cell information from one host location to another. What I saw in NSight Systems was that doing a vecmem::edm::host -> vecmem::edm::buffer host-to-host copy was very comparable to then doing a vecmem::edm::buffer -> vecmem::edm::buffer host-to-device copy.

As a reminder, the host-to-host step would seem to be useful to copy the entire payload of an SoA container in one step, instead of copying its payload column-by-column.

But as it turns out, the overhead of copying a cell collection in 5 steps instead of one (a traccc cell has only 5 variables) is negligible compared to how long it takes to copy a few megabytes from one place to another in host memory. 😕

So in this PR I want to see exactly how copying the same sort of EDM, once in AoS and then in SoA form, would compare with each other. Right now, with only the host copies existing, I get:

[bash][pcadp04]:vecmem > ./build/bin/vecmem_benchmark_core --benchmark_filter="AoS|SoA"
2024-09-27T11:02:20+02:00
Running ./build/bin/vecmem_benchmark_core
Run on (48 X 1796.81 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x24)
  L1 Instruction 32 KiB (x24)
  L2 Unified 512 KiB (x24)
  L3 Unified 32768 KiB (x4)
Load Average: 0.06, 0.07, 0.21
---------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------
simpleSoADirectHostToFixedBufferCopy/1               47.0 ns         46.9 ns     14915255 Bytes=16 Rate=325.124M/s
simpleSoADirectHostToFixedBufferCopy/8               44.8 ns         44.7 ns     15652590 Bytes=128 Rate=2.66556G/s
simpleSoADirectHostToFixedBufferCopy/64              48.8 ns         48.7 ns     14353059 Bytes=1024 Rate=19.5998G/s
simpleSoADirectHostToFixedBufferCopy/512              113 ns          113 ns      6180930 Bytes=8k Rate=67.4712G/s
simpleSoADirectHostToFixedBufferCopy/4096            1237 ns         1235 ns       566747 Bytes=64k Rate=49.4234G/s
simpleSoADirectHostToFixedBufferCopy/32768          15540 ns        15496 ns        45228 Bytes=512k Rate=31.5094G/s
simpleSoADirectHostToFixedBufferCopy/262144        124628 ns       124264 ns         5585 Bytes=4M Rate=31.435G/s
simpleSoADirectHostToFixedBufferCopy/2097152      1440522 ns      1434445 ns          487 Bytes=32M Rate=21.7854G/s
simpleSoADirectHostToFixedBufferCopy/16777216     9900081 ns      9868170 ns           62 Bytes=256M Rate=25.334G/s
simpleSoADirectHostToFixedBufferCopy/67108864    44172606 ns     44038170 ns           14 Bytes=1024M Rate=22.7076G/s
simpleSoAOptimalHostToFixedBufferCopy/1              25.4 ns         25.4 ns     27560671 Bytes=16 Rate=600.753M/s
simpleSoAOptimalHostToFixedBufferCopy/8              25.5 ns         25.4 ns     27556719 Bytes=128 Rate=4.69248G/s
simpleSoAOptimalHostToFixedBufferCopy/64             31.3 ns         31.3 ns     22272491 Bytes=1024 Rate=30.4965G/s
simpleSoAOptimalHostToFixedBufferCopy/512            90.7 ns         90.5 ns      7724358 Bytes=8k Rate=84.263G/s
simpleSoAOptimalHostToFixedBufferCopy/4096           1212 ns         1210 ns       579650 Bytes=64k Rate=50.4524G/s
simpleSoAOptimalHostToFixedBufferCopy/32768         15656 ns        15610 ns        45481 Bytes=512k Rate=31.2805G/s
simpleSoAOptimalHostToFixedBufferCopy/262144       124211 ns       123845 ns         5603 Bytes=4M Rate=31.5414G/s
simpleSoAOptimalHostToFixedBufferCopy/2097152     1148184 ns      1144388 ns          601 Bytes=32M Rate=27.3072G/s
simpleSoAOptimalHostToFixedBufferCopy/16777216    9687522 ns      9655885 ns           63 Bytes=256M Rate=25.8909G/s
simpleSoAOptimalHostToFixedBufferCopy/67108864   47272263 ns     47123369 ns           13 Bytes=1024M Rate=21.2209G/s
simpleAoSHostToFixedBufferCopy/1                     19.4 ns         19.3 ns     36222530 Bytes=16 Rate=789.511M/s
simpleAoSHostToFixedBufferCopy/8                     19.4 ns         19.3 ns     36219083 Bytes=128 Rate=6.16873G/s
simpleAoSHostToFixedBufferCopy/64                    23.5 ns         23.5 ns     29830658 Bytes=1024 Rate=40.6423G/s
simpleAoSHostToFixedBufferCopy/512                   79.7 ns         79.5 ns      8805052 Bytes=8k Rate=95.9561G/s
simpleAoSHostToFixedBufferCopy/4096                  1192 ns         1190 ns       587952 Bytes=64k Rate=51.2917G/s
simpleAoSHostToFixedBufferCopy/32768                15308 ns        15264 ns        45633 Bytes=512k Rate=31.9891G/s
simpleAoSHostToFixedBufferCopy/262144              123389 ns       123034 ns         5707 Bytes=4M Rate=31.7494G/s
simpleAoSHostToFixedBufferCopy/2097152            1143912 ns      1140204 ns          599 Bytes=32M Rate=27.4074G/s
simpleAoSHostToFixedBufferCopy/16777216           9814677 ns      9783654 ns           62 Bytes=256M Rate=25.5528G/s
simpleAoSHostToFixedBufferCopy/67108864          47313324 ns     47164521 ns           13 Bytes=1024M Rate=21.2024G/s
[bash][pcadp04]:vecmem >

Many aspects of these results I believe I understand. But I'm really not sure why the copy speed drops as it does for large sizes. 😕

In any case, I plan to continue the investigation...

@krasznaa krasznaa added the tests This issue or pull request is related to the test suite label Sep 27, 2024
@krasznaa krasznaa force-pushed the SoACopyBenchmarking-main-20240927 branch from b22d293 to f828c4b Compare September 27, 2024 15:51
So that it would be easier to set up the CUDA, HIP and SYCL
tests as a next step.
Simply copying the current CUDA benchmark code, with
all its imperfections.
@krasznaa krasznaa force-pushed the SoACopyBenchmarking-main-20240927 branch from f828c4b to cbc49a0 Compare October 1, 2024 13:52
@krasznaa krasznaa marked this pull request as ready for review October 1, 2024 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests This issue or pull request is related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant