note on EntrywiseMap and vectorization #237

jeffhammond · 2017-08-09T04:20:10Z

This is just a comment that @timmoon10 and others may find useful.

I see the following output when compiling Elemental with Intel 18 beta:

/home/jrhammon/Work/Elemental/git/include/El/blas_like/level1/EntrywiseMap.hpp(93): warning #15552: loop was not vectorized with "simd"
/home/jrhammon/Work/Elemental/git/include/El/blas_like/level1/EntrywiseMap.hpp(93): warning #15552: loop was not vectorized with "simd"
/home/jrhammon/Work/Elemental/git/include/El/blas_like/level1/EntrywiseMap.hpp(93): warning #15552: loop was not vectorized with "simd"
/home/jrhammon/Work/Elemental/git/include/El/blas_like/level1/EntrywiseMap.hpp(93): warning #15552: loop was not vectorized with "simd"

For reference, the relevant code is below, where EL_SIMD is using _Pragma("omp simd").

template<typename S,typename T>
void EntrywiseMap
( const Matrix<S>& A, Matrix<T>& B, function<T(const S&)> func )
{
    EL_DEBUG_CSE
    const Int m = A.Height();
    const Int n = A.Width();
    B.Resize( m, n );
    const S* ABuf = A.LockedBuffer();
    T* BBuf = B.Buffer();
    const Int ALDim = A.LDim();
    const Int BLDim = B.LDim();
    EL_PARALLEL_FOR
    for( Int j=0; j<n; ++j )
    {
        EL_SIMD
        for( Int i=0; i<m; ++i )
        {
            BBuf[i+j*BLDim] = func(ABuf[i+j*ALDim]);
        }
    }
}

The problem is that vectorizing std::function is hard. If one wants these to vectorize, one likely has to declare them as SIMD functions (see e.g. https://software.intel.com/en-us/node/524514 for details).

Interestingly enough, the Intel compiler will auto-vectorize lambdas, so if you implement and use EntrywiseMap with lambdas instead of std::functions, then you are likely to get SIMD code.

Another way to realize threaded+vectorized code in Elemental would be use C++17 Parallel STL, which Intel has implemented in Intel 18 beta (although this is currently somewhat irrelevant due to #215 and similar). std::for_each( pstl::execution::unseq, ...) generates SIMD code for lambdas. Unfortunately, unseq isn't standard (yet) but it's trivial to abstract that away.

The text was updated successfully, but these errors were encountered:

timmoon10 · 2017-08-10T22:51:26Z

It might be better to remove the SIMD functionality here if we assume that the function is expensive and hard to vectorize. In this regime, the cost of divisions and modulus operations for 'omp parallel collapse(2)' will be relatively minor and load balancing may be more important. We have already implemented vectorized code for common memory bound operations.

jeffhammond · 2017-08-10T23:08:42Z

@timmoon10 Yeah, but I wonder how far we should go down this path. Do we not have a way to give the user a pointer to this data so they implement an element-wise map in their code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

note on EntrywiseMap and vectorization #237

note on EntrywiseMap and vectorization #237

jeffhammond commented Aug 9, 2017

timmoon10 commented Aug 10, 2017

jeffhammond commented Aug 10, 2017

note on EntrywiseMap and vectorization #237

note on EntrywiseMap and vectorization #237

Comments

jeffhammond commented Aug 9, 2017

timmoon10 commented Aug 10, 2017

jeffhammond commented Aug 10, 2017