Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

does SORTPAIRS need to use reserve+emplace_back? #262

Open
jeffhammond opened this issue Aug 31, 2022 · 7 comments
Open

does SORTPAIRS need to use reserve+emplace_back? #262

jeffhammond opened this issue Aug 31, 2022 · 7 comments

Comments

@jeffhammond
Copy link

I am working on the C++17 parallel version of SORTPAIRS, and I cannot use emplace_back there. I find that I can switch from reserve to resize and use simple assignment instead of emplace_back, but I don't know if this consistent with the intent.

Unfortunately, I don't see any non-RAJA parallel references to guide me here. The OMP and CUDA parallel versions just call the RAJA::sort_pairs method.

If it is allowed to use resize+(assign) instead, I'll make that change in the sequential implementation, independent of what I'm doing with C++ parallelism.

Thanks!

old

        vector_of_pairs.reserve(iend-ibegin);

        std::for_each( //std::execution::par, // parallelism leads to incorrectness
                       begin,end,
                       [=,&vector_of_pairs](Index_type iemp) noexcept {
          vector_of_pairs.emplace_back(x[iend*irep + iemp], i[iend*irep + iemp]);
        });

new

        vector_of_pairs.resize(iend-ibegin);

        std::for_each( std::execution::par,
                       begin,end,
                       [=,&vector_of_pairs](Index_type iemp) noexcept {
          vector_of_pairs[iemp] = std::make_pair(x[iend*irep + iemp], i[iend*irep + iemp]);
        });
@MrBurmark
Copy link
Member

You are right, emplace_back is not parallel safe. Unfortunately resize will default construct items in memory sequentially, which will zero storage for types that are trivially constructible. We really want to do something like uninitialized_fill with std::execution::par, except that we need to generate the elements based on their index. I don't see a way to fill a vector in that way with a parallel policy. We may have to allocate uninitialized storage then use for_each with std::execution::par to initialize the memory with placement new/std::construct_at.

https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGIAMykrgAyeAyYAHI%2BAEaYxBLSAA6oCoRODB7evgGkyamOAqHhUSyx8VK2mPYFDEIETMQEmT5%2BgXaYDul1DQRFkTFxCbb1jc3ZbSO9Yf2lg1IAlLaoXsTI7BwmGgCC5v5hyN5YANQm/m7hAO6n2Js7ZnsMB17Hp24EAJ6JmAD6BMRMhAU11uu32h0wJzObBYJHewO2oMe4Mhbi8jlohDh/huCPuYOeENeYmAJEICBY8LuDyeLzOmFUHTR6Upty8qSMR0SAOIkIAIkcFAR0CAQFy8MRXuhltF6KQjmECMD/FYEdtBcQvA4jlRiJg9RKEQB2FVbI5mo4AN1QeHQR1QX3%2BRGIEHmECtNoAVJz5kc0AxBSdjQKhSKdXqIIl5qcLIHeSDDXHlSDtkw0ahtahUBAFV7YsAwnKc0dXHKpV4ZRC/YKvapC4JfQJq/Ko0aTebg8KQF4GHgAI5eH6JP6vMVkbW6/XXS0dJ3fVBUb6joHGwVMRzIb6iQUj7ke64QQWdlhiWioZBQVwAWjzYXmHtSAC9MPOI9z5u%2BTAno8nTebD6GSG%2BTAmGQBAIA7EV6UZGoRS5MhbnbRCkPbG8GCCBh0FIBDkJw80TAAVisfxeSwswADYLRnQD50XbkgXw3ls3rPBMBYSMjgYVB6TWIdAzbZD/xQRs/k1AhvjXN0qOIOcFyXAA6YBMAIF1rBYti5VUAiLDUxICJI%2BUtJ0vSoyTbZ20/XkTP4s1sLNXUCBWBhpwcajZLo78jV5DhFloTh8N4PwOC0UhUE4NxrGsAVllWQl7h4UgCE0bzFgAawCAAOOSAE5JC4Q0crIrgNH8Q1Cv0ThJACpKQs4XgFBADQEqSxY4FgGBECEti6DichKDQLr6HiYAuDMMw%2BDoAg4nqiBomq6Iwgad5OHi/q2EEAB5BhaCWoLeCwY8jHEbg9vFGc8Eo%2BrdqCBlkDRdZgoVKplt4DFon%2BYh3g8LBqr%2BPAWGexYqAMYAFAANRYi51q%2BQL4v4QQRDEdgKjh%2BQlDUardC4fRDGMCLLH0PBonqyBFntGpLsvQ9Tl5UxLGsMwNCOS91rMJngDZ4B0GiOqqjO5wIFcMY/CxkJphKMo9DyNIBCFyWUmlhg%2BnFuZKmqLpJllrH2k6ARukaJWBnKYYek1439bFw2JEWBRorWPQ/j1AHyo4fzSEC4LQo4VR0rIy8yMkI5gGQZAjhGuTWYgcK6fxo5cEIEgTjiuUPAGuJE/8Lh5l4RLdvfUg0v8TLDXS/wNENQ1JEkfCsqkSQyOdyrSH%2Bw0zDkkv8MkLLi/8Duu5Lt3qs9uqGqa3PSFajrlgIRI0V6iB%2BsSbriAiVh1m933/d9AwOS4fw5I0OT4swfAnRtPQUYR8RkdkRQVHUK7MdIC5/kSJ3fJdqqrs99a0Rngg7SoEcdefsA5BxDmHCOKdF6DXTpnbOzVFgIGAlgeILoG68H%2BmYQ0h9/b4Q0GYLguD8GEMNAPL%2BtVbAjxzloPOaUsFyRbpIdKLcND4TImYLKrCyL4Wdv4T%2BHsKHUO8uPdqE8IBIAXkvOekjBooG3sNSQGhGo0FoJNYg01ZpXXmswD6z1SCrUYAQTa21qr7RxkdYK%2BBdSdAutVKCt1Jp6Meu/YKr13qfQwPdbOxA/pOyBkwEG4NMCQ2hnoi%2Bogr7SBRrfdGD8QBjXkSgPGNhXrEzQWTdIFMqbEVplYSwDMmYszZhzLmPM1b80Fp4FoehRbFEtljKWNRTaNPSAbWYRttY1D1k0Kp2Qta8x1rUSYbSJZaw1r04WZsph1PaVbJYKw7ZYwdusHgPk/L8N4J7YBm8Dg41DpIfe%2B8jiR2SXKOOTpYHJ1QKnHkuwzBZ1HjQ1KAQ97%2BDee8j5Hz0FNxAIaPe5ES5ZXSgC/wQKsFkIERwYejUhEtVEUgKef9pHXOgXEFebBODbIDrsneByD7BWPvHHxwosbhMRkMaJaN77BV0GNZ%2BTBX7HTWR/CFmzOA/2nmiABQCfYgK3nsqQhzGaRxRUvdO9z4G50QcgwYaD36N0wWRBhGcyIaDyiqtVLd0qspqlCyhMKEH5xeXJT5pr3m8I2bqyVTznZmEtUPR5yVSCUXUekEAkggA

@mhoemmen
Copy link

mhoemmen commented Aug 31, 2022

Hi @jeffhammond ! Here is how I would rewrite that above godbolt example.

using double_and_int = std::tuple<double, int>;

template<class ElementType, std::integral Size, class ElementFunction>
requires(std::is_invocable_r_v<ElementType, ElementFunction, Size>)
void uninitialized_generate_n(ElementType raw_array[], const Size num_elements, ElementFunction&& f)
{
    auto range = std::views::iota(Size(0), num_elements);
    std::for_each(std::execution::par, std::ranges::begin(range), std::ranges::end(range),
      [raw_array, &f](const Size k) { std::construct_at(raw_array + k, f(k)); });
}

template<std::integral Size, class ElementFunction>
requires(std::is_invocable_v<ElementFunction, Size>)
std::unique_ptr< decltype(std::declval<ElementFunction>()(Size{}))[] >
create_array_and_generate_elements(const Size num_elements, ElementFunction&& f)
{
    using element_type = decltype(std::declval<ElementFunction>()(Size{}));
    auto arr = std::make_unique<element_type[]>(num_elements);
    uninitialized_generate_n(arr.get(), num_elements, std::forward<ElementFunction>(f));
    return arr;
}

int main()
{
    constexpr std::size_t N = 5;
    std::array<double, N> x{5.0, 1.0, 4.0, 2.0, 3.0};
    std::array<int, N> i{5, 1, 4, 2, 3};
    auto aos = create_array_and_generate_elements(N, [=](std::size_t k) { return std::tuple{x[k], i[k]}; });
    for(std::size_t k = 0; k < N; ++k) {
        std::cout << "k: {" << std::get<0>(aos[k]) << ", " << std::get<1>(aos[k]) << "}\n";
    }
    return 0;
}

The parallel ranges experience is not very good, though: https://godbolt.org/z/1Gadz7M5r

MSVC claims that iota_view doesn't produce forward iterators -- a nicely readable error message! Clang 14.0.0 generates a lot of angry output. A custom counting iterator type could also work.

@jeffhammond
Copy link
Author

jeffhammond commented Sep 1, 2022

Unfortunately resize will default construct items in memory sequentially, which will zero storage for types that are trivially constructible.

Sorry, but why is this bad? Because initialization faults pages and causes problems with NUMA (including GPU migratable memory)? Or because sequential initialization is slow?

Given that std::sort is O(N log N), how likely is zero initialization going to be the performance bottleneck? If I'm going to have a sequential bottleneck, I'd much rather it look like memset than a loop over emplace_back, given that fast zeroing of memory is an idiom that can and is optimized in hardware, whereas the latter is surely not.

And finally, while this pattern might be bottleneck in RAJAPerf, how representative is it of LLNL applications to have allocation and initialization of a vectory of pairs on the critical path?

@MrBurmark
Copy link
Member

Those are both things that I would think of, but I don't have the most experience with cpu threading so I'll take your word for it that they're not that bad. We would probably run at least 1 MPI process per NUMA domain so that may not be an issue anyway.

I don't know a lot about how hardware acceleration of zeroing works, does that get accelerated if there is padding that isn't zeroed? How much does the factor of P affect the calculation on whether an extra O(N) matters or not?

I'm not aware of any places where sort pairs is a significant performance bottleneck for LLNL applications, so effort would probably be better spent elsewhere.

@MrBurmark
Copy link
Member

@mhoemmen Do you know if uninitialized_generate was proposed for the stl?

I also planned to use std::make_unique<element_type[]> before I realized that it would generate a zeroing loop, at .L2 in the gcc output. What's your opinion on that extra zeroing?

@MrBurmark
Copy link
Member

Actually, isn't it undefined behavior to construct_at objects into the memory allocated via std::make_unique<element_type[]> because the memory is already initialized with the array of objects?

@mhoemmen
Copy link

mhoemmen commented Sep 1, 2022

@MrBurmark wrote:

... because the memory is already initialized with the array of objects?

That's a good point, thank you! I had forgotten that make_unique value-initializes the elements of the array. make_unique_for_overwrite default-initializes the objects. Unfortunately, default-initializing a tuple<double,int> value-initializes the elements.

The fix is to use custom allocation and deallocation. tuple<double, int> is trivially destructible, because all of its template arguments are trivially destructible. Thus, we don't need to invoke ~tuple() on the elements of the array before freeing it.

template<class ElementType>
requires(std::is_trivially_destructible_v<ElementType>)
struct deleter {
  void operator()(ElementType* p) { free(p); }
};

template<class ElementType>
std::unique_ptr<ElementType[], deleter<ElementType>>
allocate_uninitialized_array(const std::size_t size)
{
  return {
    reinterpret_cast<ElementType*>(malloc(size * sizeof(ElementType))), 
    deleter<ElementType>{}
  };
}

template<std::integral Size, class ElementFunction>
requires(std::is_invocable_v<ElementFunction, Size>)
std::unique_ptr< decltype(std::declval<ElementFunction>()(Size{}))[] >
create_array_and_generate_elements(const Size num_elements, ElementFunction&& f)
{
    using element_type = decltype(std::declval<ElementFunction>()(Size{}));
    auto arr = allocate_uninitialized_array<element_type[]>(num_elements);
    uninitialized_generate_n(arr.get(), num_elements, std::forward<ElementFunction>(f));
    return arr;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants