Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with lane difference between SSE/AVX #2404

Open
7sharp9 opened this issue Dec 11, 2024 · 22 comments
Open

How to deal with lane difference between SSE/AVX #2404

7sharp9 opened this issue Dec 11, 2024 · 22 comments

Comments

@7sharp9
Copy link

7sharp9 commented Dec 11, 2024

Hi, Im enjoying trying out this library but I've come to a point where Im a little confused in how to deal with lane differences between SSE/AVX when dealing with doubles. As you know for SSE theres two doubles per lane and for AVE there would be 4.

Take this piece of code:

        using T = double;
        HWY_ATTR void CoefficientState (T b0, T b1, T b2, T a1, T a2, T* HWY_RESTRICT coefficients) {
            // Use ScalableTag compatible with this target
            hn::ScalableTag<double> d;

            // Prepare coefficient matrix
            double coeffs[4][8] = { { 0, 0, 0, b0, b1, b2, a1, a2 },
                                    { 0, 0, b0, b1, b2, 0, a2, 0 },
                                    { 0, b0, b1, b2, 0, 0, 0, 0 },
                                    { b0, b1, b2, 0, 0, 0, 0, 0 } };

            for (int i = 0; i < 8; i++) {
                // Update rows with a1 and a2 dependencies
                coeffs[1][i] += a1 * coeffs[0][i];
                coeffs[2][i] += a1 * coeffs[1][i] + a2 * coeffs[0][i];
                coeffs[3][i] += a1 * coeffs[2][i] + a2 * coeffs[1][i];
            }

            // Load coefficients into Highway vectors
            for (int i = 0; i < 8; ++i) {
                alignas (alignof (T) * 4) T values[4] = { coeffs[3][i], coeffs[2][i], coeffs[1][i], coeffs[0][i] };
                auto vec = hn::Load (d, values); // Load the values into a vector
                hn::Store (vec, d, &coefficients[i]); // Store the vector into the coefficients
            }
        }

In this example the coeffs are processed and put into highway vectors. On AVX this is fine as we have 4 lanes. On SSE we only have 2 lanes so how do you handle this sort of thing with highway? Would you process via the stride length and have the SSE version return twice as many elements? Is this something where I would have to write a seperate version for SSE?

Any help would be greatly appreciated!

p.s. This is the pre-calc part of the code Im working on. The main simd part that does the calculation consumes the simd registers from this part.

@eugeneo
Copy link

eugeneo commented Dec 11, 2024

us hn::Lanes(d) to get the number of lanes for the target you're compiling for. I would have values type set to HNY_ALIGN std::array<T, d.MaxLanes()> values and use loops instead of hand unroll.

@jan-wassenberg
Copy link
Member

Generally I'd recommend "vector length agnostic" algorithms, where each vector element has the same (semantic) type, and you don't care what the vector length is.
This works well if you have a loop over all items, and can then increment the loop counter by hn::Lanes as @eugeneo mentions.

In this code, it looks like the numbers 4 and 8 are hardcoded/special, rather than VL-agnostic. In such cases, you can cap the vector length to say 8 via hn::CappedTag<T, 8>, and then you process up to 8 items at a time. HTH?

@7sharp9
Copy link
Author

7sharp9 commented Dec 11, 2024

@eugeneo Thanks for that info. In my main calculation, in the case for SSE I would have to have two accumulators rather than the one required for AVX. For example with intrinsics for SSE I might have two accumulators like this:

        auto y0_y1 = coefficients[0] * xp3
                      + coefficients[2] * xp2
                      + coefficients[4] * xp1
                      + coefficients[6] * x0
                      + coefficients[8] * xm1_vec
                      + coefficients[10] * xm2_vec
                      + coefficients[12] * ym1_vec
                      + coefficients[15] * ym2_vec;

        // Compute y[2] and y[3] using rows 2 and 3
        auto y2_y3 =  coefficients_b[1] * xp3
                      + coefficients[3] * xp2
                      + coefficients[5] * xp1
                      + coefficients[7]  * x0
                      + coefficients[9] * xm1_vec
                      + coefficients[11] * xm2_vec
                      + coefficients[13] * ym1_vec
                      + coefficients[15] * ym2_vec;

        // Store results of the two accumulators back to output
        _mm_storeu_ps(&output[index], _mm_movelh_ps(_mm_cvtpd_ps(y0_y1), _mm_cvtpd_ps(y2_y3)));

Im just trying to get my head round the changes to my methodology...

@7sharp9
Copy link
Author

7sharp9 commented Dec 11, 2024

@jan-wassenberg The problem I have with the pre calculations, and indeed the calculations later on is that my algorith is decoupled to 4. So my main calculation loop loads 4 vectors in and process them by the coefficients. With sse there is 2 accumulators but avx there is only 1 due to the double lane length. I mean I could also expand this to AVX512 and I could decouple to 8 inputs but I would have to update coeffs[4][8] to 8x12 and that adds a bit more complexity again.

Decoupling to 4 for SSE works as the input is 4 floats which are then processed as doubles for accuracy.

@johnplatts
Copy link
Contributor

Here is the corrected version of the snippet above (using hn::CappedTag and hn::StoreInterleaved4):

#include "hwy/highway.h"

HWY_BEFORE_NAMESPACE();
namespace test {
namespace HWY_NAMESPACE {

namespace hn = hwy::HWY_NAMESPACE;

template <class T>
HWY_ATTR void CoefficientState(T b0, T b1, T b2, T a1, T a2,
                               T* HWY_RESTRICT coefficients) {
  hn::CappedTag<T, 8> d;

  constexpr size_t kCoeffsAlign =
      HWY_MIN(sizeof(T) * 32, HWY_MIN(HWY_MAX_BYTES, 64));

  // Prepare coefficient matrix
  alignas(kCoeffsAlign) T coeffs[4][8] = {{0, 0, 0, b0, b1, b2, a1, a2},
                                          {0, 0, b0, b1, b2, 0, a2, 0},
                                          {0, b0, b1, b2, 0, 0, 0, 0},
                                          {b0, b1, b2, 0, 0, 0, 0, 0}};

  const auto v_a1 = hn::Set(d, a1);
  const auto v_a2 = hn::Set(d, a2);

  const size_t N = hn::Lanes(d);
  for (size_t i = 0; i < 8; i += N) {
    // Update rows with a1 and a2 dependencies
#if HWY_MAX_LANES <= 64
    auto v_coeffs0 = hn::Load(d, &coeffs[0][i]);
    auto v_coeffs1 = hn::Load(d, &coeffs[1][i]);
    auto v_coeffs2 = hn::Load(d, &coeffs[2][i]);
    auto v_coeffs3 = hn::Load(d, &coeffs[3][i]);
#else
    auto v_coeffs0 = hn::LoadU(d, &coeffs[0][i]);
    auto v_coeffs1 = hn::LoadU(d, &coeffs[1][i]);
    auto v_coeffs2 = hn::LoadU(d, &coeffs[2][i]);
    auto v_coeffs3 = hn::LoadU(d, &coeffs[3][i]);
#endif

    v_coeffs1 = hn::MulAdd(v_a1, v_coeffs0, v_coeffs1);
    v_coeffs2 =
        hn::MulAdd(v_a2, v_coeffs0, hn::MulAdd(v_a1, v_coeffs1, v_coeffs2));
    v_coeffs3 =
        hn::MulAdd(v_a2, v_coeffs1, hn::MulAdd(v_a1, v_coeffs2, v_coeffs3));

    // Store coefficients
    hn::StoreInterleaved4(v_coeffs3, v_coeffs2, v_coeffs1, v_coeffs0, d,
                          coefficients + i * 4);
  }
}

}  // namespace HWY_NAMESPACE
}  // namespace test
HWY_AFTER_NAMESPACE();

@7sharp9
Copy link
Author

7sharp9 commented Dec 12, 2024

Thanks @johnplatts I'm reading through this now trying to understand fully.

In the for loop: for (size_t i = 0; i < 8; i += N) { When running SSE, does that read 2 doubles from &coeffs[0][i] ?

And in the store part hn::StoreInterleaved4 is what gives me the ability to treat 2 or 4 lanes+ agnostically?

I was trying to do something a bit hacky but this doesnt work properly as the d type was avx even in else block so I got a GPF.

            // Load coefficients into Highway vectors
            for (int i = 0; i < 8; ++i) {
                // Store the results back into the coefficients

                #if HWY_TARGET <= HWY_AVX2
                    // For AVX (4 lanes), use one store
                    HWY_ALIGN std::array<T, d.MaxLanes()> values = {  coeffs[3][i], coeffs[2][i], coeffs[1][i], coeffs[0][i] };
                    auto vec = hn::Load(d, values.data());
                    hn::Store(vec, d, &coefficients[i]);
                #else
                    HWY_ALIGN std::array<T, d.MaxLanes()> values = { coeffs[1][i], coeffs[0][i] };
                    HWY_ALIGN std::array<T, d.MaxLanes()> values2 = { coeffs[3][i], coeffs[2][i] };
                    auto vec = hn::Load(d, values.data());
                    auto vec2 = hn::Load(d, values2.data());

                    // For SSE (2 lanes), store twice
                    hn::Store(vec, d, &coefficients[i]);
                    hn::Store(vec2, d, &coefficients[i + 8]);
                #endif
            }

@johnplatts
Copy link
Contributor

In the for loop: for (size_t i = 0; i < 8; i += N) { When running SSE, does that read 2 doubles from &coeffs[0][i] ?

When running on SSE2/SSSE3/SSE4, the loop does read 2 doubles from &coeffs[0][i].

And in the store part hn::StoreInterleaved4 is what gives me the ability to treat 2 or 4 lanes+ agnostically?

hn::StoreInterleaved4(v_coeffs3, v_coeffs2, v_coeffs1, v_coeffs0, d, coefficients + i * 4) actually stores v_coeffs3[0] in coefficients[i*4], v_coeffs2[0] in coefficients[i*4+1], v_coeffs1[0] in coefficients[i*4+2], v_coeffs0[0] in coefficients[i*4+3], v_coeffs3[1] in coefficients[i*4+4], v_coeffs2[1] in coefficients[i*4+5], v_coeffs1[1] in coefficients[i*4+6], v_coeffs0[1] in coefficients[i*4+7], and so on.

@7sharp9
Copy link
Author

7sharp9 commented Dec 12, 2024

Im using dynamic dispatch so how would I use the CoefficientState? I got stuck as on how to export CoefficientState as it's not a function?

Thanks for all your help @johnplatts

@johnplatts
Copy link
Contributor

johnplatts commented Dec 12, 2024

@7sharp9 An updated version of CoefficientState that uses dynamic dispatch can be found over in Compiler Explorer at https://godbolt.org/z/GTz9Mh7oe

@7sharp9
Copy link
Author

7sharp9 commented Dec 13, 2024

@johnplatts Many thanks for that it works really well. This is only the pre computations for the main calc but its 14x faster than the auto vectorised code!

Im wondering about AVX512, the reason for the 4 rows on the coeffs was that on SSE we woudl have 4 floats read at once so it made sense to decouple the algorithm to the minimum number of lanes for floats. With AVX/AVX512 the lane number would jump up dramatically so I could read 8 or 16 inputs at once. As the calcs are done as doubles though the actual calculations would still be 4 on AVX but could be 8 on AVX512

That would mean that coeffs would be have to be larger to accommodate this. E.g for 8 floats it would be be a cap of 12 rather than 8 but you can see there would be some sparcity in the matrix:

        x[i+7] x[i+6] x[i+5] x[i+4] x[i+3] x[i+2] x[i+1]  x[i]  x[i-1] x[i-2] y[i-1] y[i-2]
y[i]   =   0      0     0      0       0      0      0     b0     b1     b2     a1     a2
y[i+1] =   0      0     0      0       0      0     b0     b1     b2     0      a2      0   + a1*y[i]
y[i+2] =   0      0     0      0       0     b0     b1     b2     0      0       0      0   + a1*y[i+1] + a2*y[i]
y[i+3] =   0      0     0      0      b0     b1     b2     0      0      0       0      0   + a1*y[i+2] + a2*y[1+1]
y[i+4] =   0      0     0     b0      b1     b2      0     0      0      0       0      0   + a1*y[i+3] + a2*y[1+2]
y[i+5] =   0      0    b0     b1      b2      0      0     0      0      0       0      0   + a1*y[i+4] + a2*y[1+3]
y[i+6] =   0     b0    b1     b2       0      0      0     0      0      0       0      0   + a1*y[i+5] + a2*y[1+4]
y[i+7] =  b0     b1    b2      0       0      0      0     0      0      0       0      0   + a1*y[i+6] + a2*y[1+5] 

I was wondering on what your thoughts were on supporting higher numbers so that more lanes were filled when doing the calculations? I think coeffs could be automated. The bn coefficients just more to the left each row and the an coefficients do the same they just become 0 after the y[i-1] column. Im wondering if automating this for AVX512 would be worth the added complexity

@7sharp9
Copy link
Author

7sharp9 commented Dec 13, 2024

I realise now looking through the code on Compiler Explorer that the hn::CappedTag<T, 8> d; doesnt really serve any purpose as the code will work just fine with AVX2+ processing all 8 columns at once. This also made me realise that expanding coeefs` to processing n+7 woud mean that there would be 12 rows so with AVX512 there would be a half row on the second read etc.. Expanding to n+15 or higher would have similar issues so maybe 8x4 is the sweet spot?

@johnplatts
Copy link
Contributor

I realise now looking through the code on Compiler Explorer that the hn::CappedTag<T, 8> d; doesnt really serve any putpose as the code will work just fine with AVX2+ processing all 8 columns at once.

There is a difference between hn::CappedTag<double, 8> and hn::ScalableTag<double> as it is possible for Lanes(hn::ScalableTag<double>()) to be greater than 8 on RVV/SVE targets.

Also, there is a difference between hn::CappedTag<uint8_t, 8> and hn::ScalableTag<uint8_t> on targets other than HWY_SCALAR as Lanes(hn::ScalableTag<uint8_t>()) is at least 16 on targets other than HWY_SCALAR.

Load(hn::ScalableTag<double>(), ptr) will actually access Lanes(hn::ScalableTag<double>()) * 8 bytes, even if Lanes(hn::ScalableTag<double>()) > 8 is true.

@7sharp9
Copy link
Author

7sharp9 commented Dec 13, 2024

Ah ok, I didn't think of those situations, I dont have any hardware to try those on just yet. I'll have an M4 when it arrives next week but I dont think that has SVE. It probably makes sense to cap to the length of the coeff?

@johnplatts
Copy link
Contributor

Ah ok, I didn't think of those situations, I dont have any hardware to try those on just yet. I'll have an M4 when it arrives next week but I dont think that has SVE. It probably makes sense to cap to the length of the coeff?

Apple M4 is an Armv9.2-A CPU, but SVE/SVE2 are optional on Armv9-A CPU's and Apple M4 does not support SVE or SVE2 (according to the AArch64 processor feature description that can be found at https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64Processors.td).

@7sharp9
Copy link
Author

7sharp9 commented Dec 16, 2024

As part of this the coefficients are used in a calculation part. Im wondering if theres simpler way of doing this which is the first part:

This is the code with intrinsics

template <>
void Filter::process(const float* input, float* output, const size_t numSamples) {
    __m128d xm1_vec = _mm_loaddup_pd (&state.xm1);
    __m128d xm2_vec = _mm_loaddup_pd (&state.xm2);
    __m128d ym1_vec = _mm_loaddup_pd (&state.ym1);
    __m128d ym2_vec = _mm_loaddup_pd (&state.ym2);

    const size_t mainLoopEnd = numSamples / 4 * 4; // Largest multiple of 4
    size_t index = 0;
    for (; index < mainLoopEnd; index += 4) {

        // Load 4 inputs
        __m128 inputs = _mm_loadu_ps(&input[index]);

        // Convert the lower two elements (x0, x1) to double precision
        __m128d x0_x1 = _mm_cvtps_pd(inputs);

        // Convert the lower two (now x2, x3) to double precision
        __m128d x2_x3 = _mm_cvtps_pd(_mm_movehl_ps(inputs, inputs));

        // Unpack input samples
        __m128d x0  = _mm_unpacklo_pd(x0_x1, x0_x1);
        __m128d xp1 = _mm_unpackhi_pd(x0_x1, x0_x1);
        __m128d xp2 = _mm_unpacklo_pd(x2_x3, x2_x3);
        __m128d xp3 = _mm_unpackhi_pd(x2_x3, x2_x3);
...

With highway would this be achieved with:

auto xm1_vec = hn::Set(d, state[0]);

and then:

        auto inputs = LoadU (f, &input[index]);
        auto x0_x1 = hn::ConvertTo(d, inputs);
        auto x2_x3 = hn::ConvertTo(d, hn::UpperHalf(f, inputs));

Finally

auto x0 = hn::InterleaveLower(x0_x1, x0_x1);

All this snippet at the top is doing is loading 4 inputs which will be floats, converting them to doubles then dropping them into x0/1/2/3. It feels like theres an operation in highweay to simplify this but the api is pretty big and Im a little slow and finding my way around xompared to the intrinsics. Idmittedly ythe intrinsics might be a sloppy way of doing this part!

@johnplatts
Copy link
Contributor

As part of this the coefficients are used in a calculation part. Im wondering if theres simpler way of doing this which is the first part:

This is the code with intrinsics

template <>
void Filter::process(const float* input, float* output, const size_t numSamples) {
    __m128d xm1_vec = _mm_loaddup_pd (&state.xm1);
    __m128d xm2_vec = _mm_loaddup_pd (&state.xm2);
    __m128d ym1_vec = _mm_loaddup_pd (&state.ym1);
    __m128d ym2_vec = _mm_loaddup_pd (&state.ym2);

    const size_t mainLoopEnd = numSamples / 4 * 4; // Largest multiple of 4
    size_t index = 0;
    for (; index < mainLoopEnd; index += 4) {

        // Load 4 inputs
        __m128 inputs = _mm_loadu_ps(&input[index]);

        // Convert the lower two elements (x0, x1) to double precision
        __m128d x0_x1 = _mm_cvtps_pd(inputs);

        // Convert the lower two (now x2, x3) to double precision
        __m128d x2_x3 = _mm_cvtps_pd(_mm_movehl_ps(inputs, inputs));

        // Unpack input samples
        __m128d x0  = _mm_unpacklo_pd(x0_x1, x0_x1);
        __m128d xp1 = _mm_unpackhi_pd(x0_x1, x0_x1);
        __m128d xp2 = _mm_unpacklo_pd(x2_x3, x2_x3);
        __m128d xp3 = _mm_unpackhi_pd(x2_x3, x2_x3);
...

Here is the version of the above loop with Google Highway:

template <>
void Filter::process(const float* input, float* output, const size_t numSamples) {
    const hn::ScalableTag<float> df32;
    const hn::RepartitionToWide<decltype(df32)> df64;
    auto xm1_vec = hn::Set(df64, state.xm1);
    auto xm2_vec = hn::Set(df64, state.xm2);
    auto ym1_vec = hn::Set(df64, state.ym1);
    auto ym2_vec = hn::Set(state.ym2);

    const size_t N = Lanes(df32);

    const size_t mainLoopEnd = numSamples & (size_t{0} - N); // Largest multiple of N
    size_t index = 0;
    for (; index < mainLoopEnd; index += N) {
        // Load N inputs
        auto inputs = hn::LoadU(df32, &input[index]);

        // Convert the even lanes of inputs to double precision
        // f64_inputs_even = {inputs[0], inputs[2], inputs[4], ... }
        auto f64_inputs_even = hn::PromoteEvenTo(df64, inputs);

        // Convert the odd lanes of inputs to double precision
        // f64_inputs_odd = {inputs[1], inputs[3], inputs[5], ... }
        auto f64_inputs_odd = hn::PromoteOddTo(df64, inputs);

        // Unpack input samples

        // x0 = { inputs[0], inputs[0], inputs[4], inputs[4], ... }
        auto x0 = hn::InterleaveLower(df64, f64_inputs_even, f64_inputs_even);
        // xp1 = { inputs[1], inputs[1], inputs[5], inputs[5], ... }
        auto xp1 = hn::InterleaveUpper(df64, f64_inputs_odd, f64_inputs_odd);
        // xp2 = { inputs[2], inputs[2], inputs[6], inputs[6], ... }
        auto xp2 = hn::InterleaveLower(df64, f64_inputs_even, f64_inputs_even);
        // xp3 = { inputs[3], inputs[3], inputs[7], inputs[7], ... }
        auto xp3 = hn::InterleaveUpper(df64, f64_inputs_odd, f64_inputs_odd);
...

@7sharp9
Copy link
Author

7sharp9 commented Dec 16, 2024

Thanks @johnplatts Is that the best way of doing this?

The next phse I do the calc, and I think I'll be able to avoid using multiple accumulators by using rhe SaveInterleaved you showed with the coefficients.

       int accumulatorNo = 2;
        __m128d accumulators[accumulatorNo];

        accumulators[0] = coefficients->coefficients[0] * xp3 + coefficients->coefficients[1] * xp2
                          + coefficients->coefficients[2] * xp1 + coefficients->coefficients[3] * x0
                          + coefficients->coefficients[4] * xm1_vec + coefficients->coefficients[5] * xm2_vec
                          + coefficients->coefficients[6] * ym1_vec + coefficients->coefficients[7] * ym2_vec;

        accumulators[1] = coefficients->coefficients[8] * xp3 + coefficients->coefficients[9] * xp2
                          + coefficients->coefficients[10] * xp1 + coefficients->coefficients[11] * x0
                          + coefficients->coefficients[12] * xm1_vec + coefficients->coefficients[13] * xm2_vec
                          + coefficients->coefficients[14] * ym1_vec + coefficients->coefficients[15] * ym2_vec;


        // Store results back to output
        _mm_storeu_ps(&output[index], _mm_movelh_ps(_mm_cvtpd_ps(accumulators[0]), _mm_cvtpd_ps(accumulators[1])));

        // Update state variables for next iteration
        xm2_vec = xp2;
        xm1_vec = xp3;
        ym2_vec = _mm_unpacklo_pd (accumulators[1], accumulators[1]);
        ym1_vec = _mm_unpackhi_pd (accumulators[1], accumulators[1]);
    }

    // Update the FilterState struct with the new state variables
    state.xm1 = _mm_cvtsd_f64 (xm1_vec);
    state.xm2 = _mm_cvtsd_f64 (xm2_vec);
    state.ym1 = _mm_cvtsd_f64 (ym1_vec);
    state.ym2 = _mm_cvtsd_f64 (ym2_vec);

    for (; index < numSamples; ++index) {
        output[index] = processSample(input[index]);
    }

e.g. rather than the two accumulators, I should be able to make that agnostic or at least better and use StoreInterleaved2 or something like that...

@johnplatts
Copy link
Contributor

Thanks @johnplatts Is that the best way of doing this?
accumulators[0] = coefficients->coefficients[0] * xp3 + coefficients->coefficients[1] * xp2
+ coefficients->coefficients[2] * xp1 + coefficients->coefficients[3] * x0
+ coefficients->coefficients[4] * xm1_vec + coefficients->coefficients[5] * xm2_vec
+ coefficients->coefficients[6] * ym1_vec + coefficients->coefficients[7] * ym2_vec;

Is coefficients->coefficients[i] a F64x2 vector (equivalent to SSE2 __m128d, AArch64 NEON float64x2_t, or VSX __vector double), or is coefficients->coefficients[i] a scalar value?

@7sharp9
Copy link
Author

7sharp9 commented Dec 17, 2024

It is an F64X2 It's essentially the result of CoefficientState.

In the code you posted:

// x0 = { inputs[0], inputs[0], inputs[4], inputs[4], ... }
        auto x0 = hn::InterleaveLower(df64, f64_inputs_even, f64_inputs_even);

That seems to be broadcasting the first read to half the lanes then read 4 to the rest whereas I was broadcasting to all the lanes. Im sure you did that intensianally so that it was lane length agnostic. I'll happily try to adapt the rest of the algoritm if I can.

The original code reads 4 samples, converted them to doubles, broadcasts the value to all lanes then multiplied and added with each of the coefficients. It did that using an accumulator per row of coefficients before writing the accumulated results to the output. The bit thats hard to work out is the accumulator part. I think if I get my head around it it might be similar to the SoreInterleaved part of CoefficientState

For reference as a scalar calculation it would be:

    //calculate output
    const double y0 = b0 * x0 + b1 * xm1 + b2 * xm2 + a1 * ym1 + a2 * ym2;

@johnplatts
Copy link
Contributor

johnplatts commented Dec 17, 2024

Thanks @johnplatts Is that the best way of doing this?

The next phse I do the calc, and I think I'll be able to avoid using multiple accumulators by using rhe SaveInterleaved you showed with the coefficients.

       int accumulatorNo = 2;
        __m128d accumulators[accumulatorNo];

        accumulators[0] = coefficients->coefficients[0] * xp3 + coefficients->coefficients[1] * xp2
                          + coefficients->coefficients[2] * xp1 + coefficients->coefficients[3] * x0
                          + coefficients->coefficients[4] * xm1_vec + coefficients->coefficients[5] * xm2_vec
                          + coefficients->coefficients[6] * ym1_vec + coefficients->coefficients[7] * ym2_vec;

        accumulators[1] = coefficients->coefficients[8] * xp3 + coefficients->coefficients[9] * xp2
                          + coefficients->coefficients[10] * xp1 + coefficients->coefficients[11] * x0
                          + coefficients->coefficients[12] * xm1_vec + coefficients->coefficients[13] * xm2_vec
                          + coefficients->coefficients[14] * ym1_vec + coefficients->coefficients[15] * ym2_vec;


        // Store results back to output
        _mm_storeu_ps(&output[index], _mm_movelh_ps(_mm_cvtpd_ps(accumulators[0]), _mm_cvtpd_ps(accumulators[1])));

        // Update state variables for next iteration
        xm2_vec = xp2;
        xm1_vec = xp3;
        ym2_vec = _mm_unpacklo_pd (accumulators[1], accumulators[1]);
        ym1_vec = _mm_unpackhi_pd (accumulators[1], accumulators[1]);
    }

    // Update the FilterState struct with the new state variables
    state.xm1 = _mm_cvtsd_f64 (xm1_vec);
    state.xm2 = _mm_cvtsd_f64 (xm2_vec);
    state.ym1 = _mm_cvtsd_f64 (ym1_vec);
    state.ym2 = _mm_cvtsd_f64 (ym2_vec);

    for (; index < numSamples; ++index) {
        output[index] = processSample(input[index]);
    }

e.g. rather than the two accumulators, I should be able to make that agnostic or at least better and use StoreInterleaved2 or something like that...

Here is the updated version of the above code with Google Highway:

template <>
void Filter::process(const float* input, float* output,
                     const size_t numSamples) {
  // Ensure that Lanes(df32) is equal to 4 by using hn::FixedTag<float, 4>
  const hn::FixedTag<float, 4> df32;
  const hn::Half<decltype(df32)> dh_f32;
  const hn::RepartitionToWide<decltype(df32)> df64;
  auto xm1_vec = hn::Set(df64, state.xm1);
  auto xm2_vec = hn::Set(df64, state.xm2);
  auto ym1_vec = hn::Set(df64, state.ym1);
  auto ym2_vec = hn::Set(df64, state.ym2);

  const size_t N = Lanes(df32);

  const size_t mainLoopEnd =
      numSamples & (size_t{0} - N);  // Largest multiple of 4
  size_t index = 0;
  for (; index < mainLoopEnd; index += N) {
    // Load 4 inputs
    auto inputs = hn::LoadU(df32, &input[index]);

    // Convert the lower two elements (x0, x1) to double precision
    auto x0_x1 = hn::PromoteLowerTo(df64, inputs);

    // Convert the lower two (now x2, x3) to double precision
    auto x2_x3 = hn::PromoteUpperTo(df64, inputs);

    // Unpack input samples
    auto x0 = hn::InterleaveLower(df64, x0_x1, x0_x1);
    auto xp1 = hn::InterleaveUpper(df64, x0_x1, x0_x1);
    auto xp2 = hn::InterleaveLower(df64, x2_x3, x2_x3);
    auto xp3 = hn::InterleaveUpper(df64, x2_x3, x2_x3);

    auto accumulator0 = hn::MulAdd(
        hn::LoadU(df64, &coefficients->coefficients[7]), ym2_vec,
        hn::MulAdd(
            hn::LoadU(df64, &coefficients->coefficients[6]), ym1_vec,
            hn::MulAdd(
                hn::LoadU(df64, &coefficients->coefficients[5]), xm2_vec,
                hn::MulAdd(
                    hn::LoadU(df64, &coefficients->coefficients[4]), xm1_vec,
                    hn::MulAdd(
                        hn::LoadU(df64, &coefficients->coefficients[3]), x0,
                        hn::MulAdd(
                            hn::LoadU(df64, &coefficients->coefficients[2]),
                            xp1,
                            hn::MulAdd(
                                hn::LoadU(df64, &coefficients->coefficients[1]),
                                xp2,
                                hn::Mul(hn::LoadU(df64, &coefficients
                                                             ->coefficients[0]),
                                        xp3))))))));

    auto accumulator1 = hn::MulAdd(
        hn::LoadU(df64, &coefficients->coefficients[7]), ym2_vec,
        hn::MulAdd(
            hn::LoadU(df64, &coefficients->coefficients[6]), ym1_vec,
            hn::MulAdd(
                hn::LoadU(df64, &coefficients->coefficients[5]), xm2_vec,
                hn::MulAdd(
                    hn::LoadU(df64, &coefficients->coefficients[4]), xm1_vec,
                    hn::MulAdd(
                        hn::LoadU(df64, &coefficients->coefficients[3]), x0,
                        hn::MulAdd(
                            hn::LoadU(df64, &coefficients->coefficients[2]),
                            xp1,
                            hn::MulAdd(
                                hn::LoadU(df64, &coefficients->coefficients[1]),
                                xp2,
                                hn::Mul(hn::LoadU(df64, &coefficients
                                                             ->coefficients[0]),
                                        xp3))))))));

    // Store results back to output
    hn::StoreU(hn::Combine(df32, hn::DemoteTo(dh_f32, accumulator1),
                           hn::DemoteTo(dh_f32, accumulator0)),
               df32, &output[index]);

    // Update state variables for next iteration
    xm2_vec = xp2;
    xm1_vec = xp3;
    ym2_vec = hn::InterleaveLower(df64, accumulator1, accumulator1);
    ym1_vec = hn::InterleaveUpper(df64, accumulator1, accumulator1);
  }

  // Update the FilterState struct with the new state variables
  state.xm1 = hn::GetLane(xm1_vec);
  state.xm2 = hn::GetLane(xm2_vec);
  state.ym1 = hn::GetLane(ym1_vec);
  state.ym2 = hn::GetLane(ym2_vec);

  for (; index < numSamples; ++index) {
    output[index] = processSample(input[index]);
  }

Note that the above code requires that HWY_LANES(float) >= 4 && Lanes(hn::ScalableTag<float>()) >= 4 is true, and HWY_LANES(float) >= 4 && Lanes(hn::ScalableTag<float>()) >= 4 is guaranteed to be true on all targets other than HWY_SCALAR.

@7sharp9
Copy link
Author

7sharp9 commented Dec 17, 2024

@johnplatts I really appricite you looking at this I'll try and understand what you have done. The limitation of >= 4 is based on the lanes use by coefficients and the fact I only calculate 4 rows?

I tried to think how this could work with only one accumulator but I could not get my head round doing it efficiently.

@7sharp9
Copy link
Author

7sharp9 commented Dec 18, 2024

This is my last question for now. In the final scalar processing for the remaining samples, is it better to have the calculations as just scalar math without using highway. Or use just one lane like this:

for (; index < numSamples; ++index) {
    // Load the input sample
    auto x0 = hn::Set(df64, input[index]);

    // Compute the output sample
    auto y0 =
              hn::MulAdd(hn::Set(df64, coefficients[7]), ym2_vec,
              hn::MulAdd(hn::Set(df64, coefficients[6]), ym1_vec,
              hn::MulAdd(hn::Set(df64, coefficients[5]), xm2_vec,
              hn::MulAdd(hn::Set(df64, coefficients[4]), xm1_vec,
              hn::Mul(hn::Set(df64, coefficients[3]), x0)))));

    // Store the output sample back as a scalar
    output[index] = static_cast<float>(hn::GetLane(y0));

    // Update state variables for the next iteration
    xm2_vec = xm1_vec;
    xm1_vec = x0;
    ym2_vec = ym1_vec;
    ym1_vec = y0;
}

It might have been better to do a masked or padded version but the manipulating of the state variable complicates that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants