Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing documentation about how to use the library. #15

Open
sxleixer opened this issue Apr 26, 2020 · 1 comment
Open

Missing documentation about how to use the library. #15

sxleixer opened this issue Apr 26, 2020 · 1 comment

Comments

@sxleixer
Copy link
Contributor

Hi there,
I am an experienced C++ programmer but I'm completely lost when it comes to SIMD operations. Currently I'm trying your library for over a week and I still cannot figure out, how to get it to be more performant than the straight forward way.

In my particular case, I am trying to create a SAXPY operation according to BLAS standard using SIMD operations. My vectors are huge and still the straight forward way is much faster. I appended the two examples with my performance measurements to the bottom.

The copy operations to the buffer array are the most time consuming part. My suspicion is, that these copy operations are not needed and the filling of the native_simd<float> takes place in a lot more implicit way. However, I haven't figured it out how to do it yet and searching the source code confuses me even more.

By the way, I already copied the values directly to the native_simd<float> object, with very similar results.

May you please provide an example on how to actually provide data to the native_simd<float> vector properly? I would really appreciate it and will surely contribute some examples on how to use the library, since I think the documentation is the key, to get that neat library to the C++ core libraries.

Obvious way

template<class numeric_t>
void normal_axpy(const int n, const numeric_t a, 
        const numeric_t* x, const int inc_x, numeric_t* y, const int inc_y) {
    for(auto i = 0, i_x = 0, i_y = 0; i < n; ++i, i_x += inc_x, i_y += inc_y) {
        y[i_y] = a * x[i_x] + y[i_y];
    }
}

Results

Results for 'Conventional'; Vector Dimension: 1000; Iterations: 10000
 -- Min: 3 (3.00 µs)
 -- Max: 45 (45.00 µs)
 -- Average: 5.0503 (5.05 µs)
 -- Median: 5 (5.00 µs)
 -- Lower Quartile: 4 (4.00 µs)
 -- Upper Quartile: 5 (5.00 µs)

Most probably wrong SIMD way

template<class numeric_t>
void axpy(const int n, const numeric_t a, 
        const numeric_t* x, const int inc_x, numeric_t* y, const int inc_y) {
    const size_t _chunk_size = native_simd<numeric_t>::size();

    native_simd<numeric_t> aa, xx, yy;

    size_t i_x = 0, i_y = 0, i_yr = 0;
    for(auto j = 0; j < _chunk_size; ++j)
        aa[j] = a;

    numeric_t buffer[_chunk_size];
    for (size_t i = 0; i < n; i += _chunk_size) {
        int num_elements = min(_chunk_size, n - i);
        for(auto j = 0; j < num_elements; ++j, i_x += inc_x)
            buffer[j] = x[i_x];
        xx.copy_from(buffer, element_aligned);

        for(auto j = 0; j < num_elements; ++j, i_y += inc_y)
            buffer[j] = y[i_y];
        yy.copy_from(buffer, element_aligned);

        yy = aa * xx + yy;

        yy.copy_to(buffer, element_aligned);
        for(auto j = 0; j < num_elements; ++j, i_yr += inc_y)
            y[i_yr] = buffer[j];
    }
}

Results

Results for 'SIMD'; Vector Dimension: 1000; Iterations: 10000
 -- Min: 18 (18.00 µs)
 -- Max: 118 (118.00 µs)
 -- Average: 26.0297 (26.03 µs)
 -- Median: 27 (27.00 µs)
 -- Lower Quartile: 21 (21.00 µs)
 -- Upper Quartile: 28 (28.00 µs)
@mattkretz
Copy link
Member

The problem seems to be inc_x and inc_y. If these were known to be 1, then you could simply write:

using V = native_simd<numeric_t>;
for (int i = 0; i < n; i += V::size()) {
  V yv = a * V(x + i, element_aligned) + V(y + i, element_aligned);
  yv.copy_to(y + i, element_aligned);
}

This gives you two load, one FMA, and one store instruction (which is the same as in the scalar case, except that in the SIMD case the loads, stores, and FMAs do way more work in one go). This perfectly fills up the execution ports on x86. But once the strides are larger than one, the CPU has to do 2*V::size() scalar load instructions and V::size() scalar store instructions per single SIMD FMA instruction. Consequently the whole time is spent in loads and stores (and shuffles, to create SIMD registers out of the scalar loads) and the FMA unit is mostly idle. Worse, AVX or even AVX-512 FMAs will reduce the CPU clock and thus make loads and stores even slower.

You could have a runtime condition on the strides to use vector loads and stores. If you often have strides of 1, this should improve your results. Else, reorganize your matrices so that the stride is statically 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants