-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve casting of complex tuples to complex vectors.. Perhaps consider alternative approach? #8
Comments
Maybe considering a different algorithm but still considering the first approach. An alternate would be to use the first approach to load the values efficiently into an LVec and then use two shufflevector instructions to get the values into a ComplexVec. (Of course this can be generalized). So this approach has two @oscardssmith I'm curious what you think about this approach? Going with the second approach instead would require a rewrite though (so I'm less inclined to push through with that) function ComplexVec3(z::NTuple{4, Complex{T}}) where {T <: FloatTypes}
x = LVec{8, Float64}((reim(z[1])..., reim(z[2])..., reim(z[3])..., reim(z[4])...))
b = shufflevector(x, Val(0:2:6))
c = shufflevector(x, Val(1:2:7))
return ComplexVec{4, Float64}(b, c)
end julia> @code_llvm debuginfo=:none SIMDMath.ComplexVec3((a, b, c, d))
define void @julia_ComplexVec3_2553([2 x <4 x double>]* noalias nocapture sret([2 x <4 x double>]) %0, [4 x [2 x double]]* nocapture nonnull readonly align 8 dereferenceable(64) %1) #0 {
top:
%2 = bitcast [4 x [2 x double]]* %1 to <8 x double>*
%3 = load <8 x double>, <8 x double>* %2, align 8
%res.i = shufflevector <8 x double> %3, <8 x double> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
%res.i2 = shufflevector <8 x double> %3, <8 x double> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
%.sroa.0.0..sroa_idx = getelementptr inbounds [2 x <4 x double>], [2 x <4 x double>]* %0, i64 0, i64 0
store <4 x double> %res.i, <4 x double>* %.sroa.0.0..sroa_idx, align 32
%.sroa.2.0..sroa_idx1 = getelementptr inbounds [2 x <4 x double>], [2 x <4 x double>]* %0, i64 0, i64 1
store <4 x double> %res.i2, <4 x double>* %.sroa.2.0..sroa_idx1, align 32
ret void
}
julia> @code_native debuginfo=:none SIMDMath.ComplexVec3((a, b, c, d))
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 11, 0
.globl _julia_ComplexVec3_2577 ; -- Begin function julia_ComplexVec3_2577
.p2align 2
_julia_ComplexVec3_2577: ; @julia_ComplexVec3_2577
.cfi_startproc
; %bb.0: ; %top
ld2 { v0.2d, v1.2d }, [x0], #32
ld2 { v2.2d, v3.2d }, [x0]
stp q0, q2, [x8]
stp q1, q3, [x8, #32]
ret
.cfi_endproc
; -- End function
.subsections_via_symbols |
I don't know much about optimal vectorization strategies here. @CELROD, do you have thoughts on this? |
Ya I think there has to be some compromise (I think you meant to tag @chriselrod here?). Also, I'm on ARM here (Apple M1). I do quite enjoy how simple complex multiplies (and therefor muladd) instructions are with the current approach Lines 13 to 24 in abf7ec4
The second approach would require more Looking more into the optimization guide compared to standard loads, an extra cycle is |
Generic implementation at #9. Which should work for any length tuple (needs tests etc). |
The second approach would be good on x64 CPUs with the fma instruction set, since they have fmaddsub instructions. EDIT: I'd have to read this document (and this issue) more carefully to say whether FCMLA actually does something useful here. You can see here for an idea of how to wrap special instructions without resorting to ASM: https://github.com/JuliaSIMD/VectorizationBase.jl/blob/e2480f9490f91a2b1e47decab87f07feb7170146/src/llvm_intrin/vfmaddsub.jl |
Thank you for the links! I'll try to play around with FCMLA instruction to see if it is helpful. Though, I've been a little hesitant to develop separate routines for x64 and ARM just on the standpoint of wanting to put as much time into the special function implementation as I can. Though, the downside is I'm relying pretty heavily on LLVM to do the appropriate optimizations. The use of the fmaddsub instruction will be very helpful but I think we'll also need a shuffle between these instruction within a complex multiply whereas the current approach does not. This was one of the reasons we were able to compute four complex evalpolys faster than two complex evalpolys in base (which were able to autovectorize) because if we were computing a string of multiplies (say like 15 muladds in a row) it seems that it was faster not to have to perform any shuffles within the hot loop. Again, I'd have to better benchmark both approaches to see but it does seem helpful if you are performing a lot of work with the vectors. I'm also not sure exactly how well the approach would interplay with But ya, I guess if the architectures are going to go with a (re, im, re, im) approach then it might be hard to beat native instructions.... Thank you for the tips! |
Ah, yes, this may be better on ARM/with the FCMLA instruction. I think it can avoid the shuffles. Depending on what you're doing, fmaddsub + fmsubadd reduces the amount of shufflying you need to do vs needing to shuffle on load when you have arrays of |
I'll try to test this out over the next few weeks. There isn't too much information on what the FCMLA is actually doing. Here is a quote from https://arxiv.org/pdf/2103.03013.pdf
So perhaps it is doing shuffling? From another paper.... (https://arxiv.org/pdf/2103.03013.pdf)
It seems that that paper also recommended the split layout. Finally, this paper (https://csrc.nist.gov/csrc/media/Events/2022/fourth-pqc-standardization-conference/documents/papers/fast-falcon-signature-generation-and-verification-pqc2022.pdf)
I'll have to really dive into these papers more but I'm not sure there is a real clear winner but it seemed like they all preferred in some instances the split format.... Again, I just very quickly skimmed these |
There are really two choices for complex arithmetic mainly in how we want to represent them.
Now, initially I liked the first approach better as it's more natural (especially coming from physics) to consider the real and imaginary pieces separate. This does make multiplication much simpler which is great and add/subtract I think are a wash between the two.
On the other hand there is something I didn't initially consider is that if you only have two imaginary numbers then in the first case you are only operating with <2 x double> width vectors compared to if you store it in the second approach you can operate on <4 x double> vectors. So if you only have two numbers the second approach will be able to fill the registers better and be much faster. Of course for four complex numbers then this advantage immediately falls off because now you can operate on two <4 x double> vectors instead of an <8 x double> vector (unless using AVX-512). So for algorithms where we only can vectorize for two complex numbers the second method is superior.
The final piece comes from how we can cast complex numbers in julia to complex vectors. Let's say we want to compute vector cross products or dot products or any really BLAS type routines where we need to be able to load a small subset of larger array and perform operations on that. The casting becomes quite significant especially if it has to be done at runtime.
For example, the current approach isn't great at this...
The first issue to tackle with this issue is can we improve this? That would help alleviate some of the concerns with this approach.
Consider more seriously second approach?
The second issue is perhaps more fundamental if we should more strongly consider the second approach. For example, the casting could look something like...
Which is significantly cleaner if we have to do this many times within a hot loop. I'm wondering if it would be possible to better reinterpret the first approach. Again, multiplication I think is more efficient with the first approach but that comes with significant downsides.
The text was updated successfully, but these errors were encountered: