Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

note on EntrywiseMap and vectorization #237

Open
jeffhammond opened this issue Aug 9, 2017 · 2 comments
Open

note on EntrywiseMap and vectorization #237

jeffhammond opened this issue Aug 9, 2017 · 2 comments

Comments

@jeffhammond
Copy link
Member

This is just a comment that @timmoon10 and others may find useful.

I see the following output when compiling Elemental with Intel 18 beta:

/home/jrhammon/Work/Elemental/git/include/El/blas_like/level1/EntrywiseMap.hpp(93): warning #15552: loop was not vectorized with "simd"
/home/jrhammon/Work/Elemental/git/include/El/blas_like/level1/EntrywiseMap.hpp(93): warning #15552: loop was not vectorized with "simd"
/home/jrhammon/Work/Elemental/git/include/El/blas_like/level1/EntrywiseMap.hpp(93): warning #15552: loop was not vectorized with "simd"
/home/jrhammon/Work/Elemental/git/include/El/blas_like/level1/EntrywiseMap.hpp(93): warning #15552: loop was not vectorized with "simd"

For reference, the relevant code is below, where EL_SIMD is using _Pragma("omp simd").

template<typename S,typename T>
void EntrywiseMap
( const Matrix<S>& A, Matrix<T>& B, function<T(const S&)> func )
{
    EL_DEBUG_CSE
    const Int m = A.Height();
    const Int n = A.Width();
    B.Resize( m, n );
    const S* ABuf = A.LockedBuffer();
    T* BBuf = B.Buffer();
    const Int ALDim = A.LDim();
    const Int BLDim = B.LDim();
    EL_PARALLEL_FOR
    for( Int j=0; j<n; ++j )
    {
        EL_SIMD
        for( Int i=0; i<m; ++i )
        {
            BBuf[i+j*BLDim] = func(ABuf[i+j*ALDim]);
        }
    }
}

The problem is that vectorizing std::function is hard. If one wants these to vectorize, one likely has to declare them as SIMD functions (see e.g. https://software.intel.com/en-us/node/524514 for details).

Interestingly enough, the Intel compiler will auto-vectorize lambdas, so if you implement and use EntrywiseMap with lambdas instead of std::functions, then you are likely to get SIMD code.

Another way to realize threaded+vectorized code in Elemental would be use C++17 Parallel STL, which Intel has implemented in Intel 18 beta (although this is currently somewhat irrelevant due to #215 and similar). std::for_each( pstl::execution::unseq, ...) generates SIMD code for lambdas. Unfortunately, unseq isn't standard (yet) but it's trivial to abstract that away.

@timmoon10
Copy link
Contributor

It might be better to remove the SIMD functionality here if we assume that the function is expensive and hard to vectorize. In this regime, the cost of divisions and modulus operations for 'omp parallel collapse(2)' will be relatively minor and load balancing may be more important. We have already implemented vectorized code for common memory bound operations.

@jeffhammond
Copy link
Member Author

@timmoon10 Yeah, but I wonder how far we should go down this path. Do we not have a way to give the user a pointer to this data so they implement an element-wise map in their code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants