-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to deal with lane difference between SSE/AVX #2404
Comments
us hn::Lanes(d) to get the number of lanes for the target you're compiling for. I would have |
Generally I'd recommend "vector length agnostic" algorithms, where each vector element has the same (semantic) type, and you don't care what the vector length is. In this code, it looks like the numbers 4 and 8 are hardcoded/special, rather than VL-agnostic. In such cases, you can cap the vector length to say 8 via hn::CappedTag<T, 8>, and then you process up to 8 items at a time. HTH? |
@eugeneo Thanks for that info. In my main calculation, in the case for SSE I would have to have two accumulators rather than the one required for AVX. For example with intrinsics for SSE I might have two accumulators like this:
Im just trying to get my head round the changes to my methodology... |
@jan-wassenberg The problem I have with the pre calculations, and indeed the calculations later on is that my algorith is decoupled to 4. So my main calculation loop loads 4 vectors in and process them by the coefficients. With sse there is 2 accumulators but avx there is only 1 due to the double lane length. I mean I could also expand this to AVX512 and I could decouple to 8 inputs but I would have to update Decoupling to 4 for SSE works as the input is 4 floats which are then processed as doubles for accuracy. |
Here is the corrected version of the snippet above (using hn::CappedTag and hn::StoreInterleaved4):
|
Thanks @johnplatts I'm reading through this now trying to understand fully. In the for loop: And in the store part I was trying to do something a bit hacky but this doesnt work properly as the d type was avx even in else block so I got a GPF.
|
When running on SSE2/SSSE3/SSE4, the loop does read 2 doubles from
|
Im using dynamic dispatch so how would I use the Thanks for all your help @johnplatts |
@7sharp9 An updated version of CoefficientState that uses dynamic dispatch can be found over in Compiler Explorer at https://godbolt.org/z/GTz9Mh7oe |
@johnplatts Many thanks for that it works really well. This is only the pre computations for the main calc but its 14x faster than the auto vectorised code! Im wondering about AVX512, the reason for the 4 rows on the That would mean that
I was wondering on what your thoughts were on supporting higher numbers so that more lanes were filled when doing the calculations? I think |
I realise now looking through the code on Compiler Explorer that the |
There is a difference between Also, there is a difference between
|
Ah ok, I didn't think of those situations, I dont have any hardware to try those on just yet. I'll have an M4 when it arrives next week but I dont think that has SVE. It probably makes sense to cap to the length of the |
Apple M4 is an Armv9.2-A CPU, but SVE/SVE2 are optional on Armv9-A CPU's and Apple M4 does not support SVE or SVE2 (according to the AArch64 processor feature description that can be found at https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64Processors.td). |
As part of this the coefficients are used in a calculation part. Im wondering if theres simpler way of doing this which is the first part: This is the code with intrinsics
With highway would this be achieved with:
and then:
Finally
All this snippet at the top is doing is loading 4 inputs which will be floats, converting them to doubles then dropping them into |
Here is the version of the above loop with Google Highway:
|
Thanks @johnplatts Is that the best way of doing this? The next phse I do the calc, and I think I'll be able to avoid using multiple accumulators by using rhe SaveInterleaved you showed with the coefficients.
e.g. rather than the two accumulators, I should be able to make that agnostic or at least better and use |
Is |
It is an F64X2 It's essentially the result of In the code you posted:
That seems to be broadcasting the first read to half the lanes then read 4 to the rest whereas I was broadcasting to all the lanes. Im sure you did that intensianally so that it was lane length agnostic. I'll happily try to adapt the rest of the algoritm if I can. The original code reads 4 samples, converted them to doubles, broadcasts the value to all lanes then multiplied and added with each of the coefficients. It did that using an accumulator per row of coefficients before writing the accumulated results to the output. The bit thats hard to work out is the accumulator part. I think if I get my head around it it might be similar to the SoreInterleaved part of For reference as a scalar calculation it would be:
|
Here is the updated version of the above code with Google Highway:
Note that the above code requires that |
@johnplatts I really appricite you looking at this I'll try and understand what you have done. The limitation of I tried to think how this could work with only one accumulator but I could not get my head round doing it efficiently. |
This is my last question for now. In the final scalar processing for the remaining samples, is it better to have the calculations as just scalar math without using highway. Or use just one lane like this:
It might have been better to do a masked or padded version but the manipulating of the state variable complicates that. |
Hi, Im enjoying trying out this library but I've come to a point where Im a little confused in how to deal with lane differences between SSE/AVX when dealing with doubles. As you know for SSE theres two doubles per lane and for AVE there would be 4.
Take this piece of code:
In this example the
coeffs
are processed and put into highway vectors. On AVX this is fine as we have 4 lanes. On SSE we only have 2 lanes so how do you handle this sort of thing with highway? Would you process via the stride length and have the SSE version return twice as many elements? Is this something where I would have to write a seperate version for SSE?Any help would be greatly appreciated!
p.s. This is the pre-calc part of the code Im working on. The main simd part that does the calculation consumes the simd registers from this part.
The text was updated successfully, but these errors were encountered: