You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there,
I am an experienced C++ programmer but I'm completely lost when it comes to SIMD operations. Currently I'm trying your library for over a week and I still cannot figure out, how to get it to be more performant than the straight forward way.
In my particular case, I am trying to create a SAXPY operation according to BLAS standard using SIMD operations. My vectors are huge and still the straight forward way is much faster. I appended the two examples with my performance measurements to the bottom.
The copy operations to the buffer array are the most time consuming part. My suspicion is, that these copy operations are not needed and the filling of the native_simd<float> takes place in a lot more implicit way. However, I haven't figured it out how to do it yet and searching the source code confuses me even more.
By the way, I already copied the values directly to the native_simd<float> object, with very similar results.
May you please provide an example on how to actually provide data to the native_simd<float> vector properly? I would really appreciate it and will surely contribute some examples on how to use the library, since I think the documentation is the key, to get that neat library to the C++ core libraries.
Obvious way
template<classnumeric_t>
voidnormal_axpy(constint n, constnumeric_t a,
constnumeric_t* x, constint inc_x, numeric_t* y, constint inc_y) {
for(auto i = 0, i_x = 0, i_y = 0; i < n; ++i, i_x += inc_x, i_y += inc_y) {
y[i_y] = a * x[i_x] + y[i_y];
}
}
The problem seems to be inc_x and inc_y. If these were known to be 1, then you could simply write:
using V = native_simd<numeric_t>;
for (int i = 0; i < n; i += V::size()) {
V yv = a * V(x + i, element_aligned) + V(y + i, element_aligned);
yv.copy_to(y + i, element_aligned);
}
This gives you two load, one FMA, and one store instruction (which is the same as in the scalar case, except that in the SIMD case the loads, stores, and FMAs do way more work in one go). This perfectly fills up the execution ports on x86. But once the strides are larger than one, the CPU has to do 2*V::size() scalar load instructions and V::size() scalar store instructions per single SIMD FMA instruction. Consequently the whole time is spent in loads and stores (and shuffles, to create SIMD registers out of the scalar loads) and the FMA unit is mostly idle. Worse, AVX or even AVX-512 FMAs will reduce the CPU clock and thus make loads and stores even slower.
You could have a runtime condition on the strides to use vector loads and stores. If you often have strides of 1, this should improve your results. Else, reorganize your matrices so that the stride is statically 1.
Hi there,
I am an experienced C++ programmer but I'm completely lost when it comes to SIMD operations. Currently I'm trying your library for over a week and I still cannot figure out, how to get it to be more performant than the straight forward way.
In my particular case, I am trying to create a SAXPY operation according to BLAS standard using SIMD operations. My vectors are huge and still the straight forward way is much faster. I appended the two examples with my performance measurements to the bottom.
The copy operations to the
buffer
array are the most time consuming part. My suspicion is, that these copy operations are not needed and the filling of thenative_simd<float>
takes place in a lot more implicit way. However, I haven't figured it out how to do it yet and searching the source code confuses me even more.By the way, I already copied the values directly to the
native_simd<float>
object, with very similar results.May you please provide an example on how to actually provide data to the
native_simd<float>
vector properly? I would really appreciate it and will surely contribute some examples on how to use the library, since I think the documentation is the key, to get that neat library to the C++ core libraries.Obvious way
Results
Most probably wrong SIMD way
Results
The text was updated successfully, but these errors were encountered: