Conversation
|
run benchmark take_kernels |
|
🤖 |
|
run benchmark interleave_kernels |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
Hmm this looks actually worse on this machine (probably x86 vs ARM...)... |
|
🤖: Benchmark completed Details
|
|
run benchmark take_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark interleave_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
@comphead @andygrove FYI as you seem to be interested in sort performance (which is one of the consumers) |
|
@mbutrovich as well :) |
|
I'm curious what a SIMD gather (AVX2, SVE) intrinsics' performance is like vs. unrolling manually on modern architectures. The former likely ends up as similar μops anyway. I don't love specializing the code like this (in either direction, SIMD intrinsics or unrolling especially with ARM increasingly prevalent in the cloud) but if today it's faster I'm okay with it. It's also a pretty well-understood technique at this point. |
| let chunks = indices.chunks_exact(8); | ||
| let remainder = chunks.remainder(); | ||
| for chunk in chunks { | ||
| let v0 = arrays[chunk[0].0].value(chunk[0].1); |
There was a problem hiding this comment.
thinking aloud, is 8 a constant? does it depend on number of cpu/threads, etc?
There was a problem hiding this comment.
It just a loop unrolling constant - the higher the more code it will generate inside the loop so it can hide more latency by running/pipelining the instructions not depending on the loads.
There was a problem hiding this comment.
yeah, thats what I was thinking if user may want to modify this value depending on their hardware/OS setup?
Or if someone would like to benchmark take or interleave with different chunks values, they would have to rewrite the method respectively? maybe some macro could help? it can be done as follow up if needed
There was a problem hiding this comment.
Ah like that - yeah that should be possible (would require some hardware setup) although I think it would not be really to a user setting compile time constants.
On the benchmarks it looks to be beneficial for both aarch64 and x86 CPUs - getting ~23% on M1 (my laptop) and -30% on intel (the @alamb-ghbot runner).
| .collect::<Vec<_>>(); | ||
| for idx in remainder { | ||
| // SAFETY: base < len == output capacity | ||
| unsafe { dst.add(base).write(arrays[idx.0].value(idx.1)) }; |
There was a problem hiding this comment.
for unsafe calls, would it make sense to debug_assert safety requirements to catch unexpected things on CI debug builds?
Agreed. Auto-vectorization is preferred and in an ideal scenario we don't need to specialize the code and write our APIs so LLVM can generate good code automatically. Ideally I think, the I am thinking perhaps we can at least remove the bounds checks for |
|
run benchmark interleave_kernels |
This reverts commit fc1adb9.
Oh wait that is a not so useful idea as we get them for all of the rows (so it will be more expensive to check upfront than it is to do "on the go") |
|
🤖 |
|
🤖: Benchmark completed Details
|
comphead
left a comment
There was a problem hiding this comment.
Thanks @Dandandan numbers look good to me, looking fwd to run DF TPC* benchmarks
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?