does SORTPAIRS need to use reserve+emplace_back? #262

jeffhammond · 2022-08-31T07:41:21Z

I am working on the C++17 parallel version of SORTPAIRS, and I cannot use emplace_back there. I find that I can switch from reserve to resize and use simple assignment instead of emplace_back, but I don't know if this consistent with the intent.

Unfortunately, I don't see any non-RAJA parallel references to guide me here. The OMP and CUDA parallel versions just call the RAJA::sort_pairs method.

If it is allowed to use resize+(assign) instead, I'll make that change in the sequential implementation, independent of what I'm doing with C++ parallelism.

Thanks!

old

        vector_of_pairs.reserve(iend-ibegin);

        std::for_each( //std::execution::par, // parallelism leads to incorrectness
                       begin,end,
                       [=,&vector_of_pairs](Index_type iemp) noexcept {
          vector_of_pairs.emplace_back(x[iend*irep + iemp], i[iend*irep + iemp]);
        });

new

        vector_of_pairs.resize(iend-ibegin);

        std::for_each( std::execution::par,
                       begin,end,
                       [=,&vector_of_pairs](Index_type iemp) noexcept {
          vector_of_pairs[iemp] = std::make_pair(x[iend*irep + iemp], i[iend*irep + iemp]);
        });

The text was updated successfully, but these errors were encountered:

MrBurmark · 2022-08-31T16:20:50Z

You are right, emplace_back is not parallel safe. Unfortunately resize will default construct items in memory sequentially, which will zero storage for types that are trivially constructible. We really want to do something like uninitialized_fill with std::execution::par, except that we need to generate the elements based on their index. I don't see a way to fill a vector in that way with a parallel policy. We may have to allocate uninitialized storage then use for_each with std::execution::par to initialize the memory with placement new/std::construct_at.

https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGIAMykrgAyeAyYAHI%2BAEaYxBLSAA6oCoRODB7evgGkyamOAqHhUSyx8VK2mPYFDEIETMQEmT5%2BgXaYDul1DQRFkTFxCbb1jc3ZbSO9Yf2lg1IAlLaoXsTI7BwmGgCC5v5hyN5YANQm/m7hAO6n2Js7ZnsMB17Hp24EAJ6JmAD6BMRMhAU11uu32h0wJzObBYJHewO2oMe4Mhbi8jlohDh/huCPuYOeENeYmAJEICBY8LuDyeLzOmFUHTR6Upty8qSMR0SAOIkIAIkcFAR0CAQFy8MRXuhltF6KQjmECMD/FYEdtBcQvA4jlRiJg9RKEQB2FVbI5mo4AN1QeHQR1QX3%2BRGIEHmECtNoAVJz5kc0AxBSdjQKhSKdXqIIl5qcLIHeSDDXHlSDtkw0ahtahUBAFV7YsAwnKc0dXHKpV4ZRC/YKvapC4JfQJq/Ko0aTebg8KQF4GHgAI5eH6JP6vMVkbW6/XXS0dJ3fVBUb6joHGwVMRzIb6iQUj7ke64QQWdlhiWioZBQVwAWjzYXmHtSAC9MPOI9z5u%2BTAno8nTebD6GSG%2BTAmGQBAIA7EV6UZGoRS5MhbnbRCkPbG8GCCBh0FIBDkJw80TAAVisfxeSwswADYLRnQD50XbkgXw3ls3rPBMBYSMjgYVB6TWIdAzbZD/xQRs/k1AhvjXN0qOIOcFyXAA6YBMAIF1rBYti5VUAiLDUxICJI%2BUtJ0vSoyTbZ20/XkTP4s1sLNXUCBWBhpwcajZLo78jV5DhFloTh8N4PwOC0UhUE4NxrGsAVllWQl7h4UgCE0bzFgAawCAAOOSAE5JC4Q0crIrgNH8Q1Cv0ThJACpKQs4XgFBADQEqSxY4FgGBECEti6DichKDQLr6HiYAuDMMw%2BDoAg4nqiBomq6Iwgad5OHi/q2EEAB5BhaCWoLeCwY8jHEbg9vFGc8Eo%2BrdqCBlkDRdZgoVKplt4DFon%2BYh3g8LBqr%2BPAWGexYqAMYAFAANRYi51q%2BQL4v4QQRDEdgKjh%2BQlDUardC4fRDGMCLLH0PBonqyBFntGpLsvQ9Tl5UxLGsMwNCOS91rMJngDZ4B0GiOqqjO5wIFcMY/CxkJphKMo9DyNIBCFyWUmlhg%2BnFuZKmqLpJllrH2k6ARukaJWBnKYYek1439bFw2JEWBRorWPQ/j1AHyo4fzSEC4LQo4VR0rIy8yMkI5gGQZAjhGuTWYgcK6fxo5cEIEgTjiuUPAGuJE/8Lh5l4RLdvfUg0v8TLDXS/wNENQ1JEkfCsqkSQyOdyrSH%2Bw0zDkkv8MkLLi/8Duu5Lt3qs9uqGqa3PSFajrlgIRI0V6iB%2BsSbriAiVh1m933/d9AwOS4fw5I0OT4swfAnRtPQUYR8RkdkRQVHUK7MdIC5/kSJ3fJdqqrs99a0Rngg7SoEcdefsA5BxDmHCOKdF6DXTpnbOzVFgIGAlgeILoG68H%2BmYQ0h9/b4Q0GYLguD8GEMNAPL%2BtVbAjxzloPOaUsFyRbpIdKLcND4TImYLKrCyL4Wdv4T%2BHsKHUO8uPdqE8IBIAXkvOekjBooG3sNSQGhGo0FoJNYg01ZpXXmswD6z1SCrUYAQTa21qr7RxkdYK%2BBdSdAutVKCt1Jp6Meu/YKr13qfQwPdbOxA/pOyBkwEG4NMCQ2hnoi%2Bogr7SBRrfdGD8QBjXkSgPGNhXrEzQWTdIFMqbEVplYSwDMmYszZhzLmPM1b80Fp4FoehRbFEtljKWNRTaNPSAbWYRttY1D1k0Kp2Qta8x1rUSYbSJZaw1r04WZsph1PaVbJYKw7ZYwdusHgPk/L8N4J7YBm8Dg41DpIfe%2B8jiR2SXKOOTpYHJ1QKnHkuwzBZ1HjQ1KAQ97%2BDee8j5Hz0FNxAIaPe5ES5ZXSgC/wQKsFkIERwYejUhEtVEUgKef9pHXOgXEFebBODbIDrsneByD7BWPvHHxwosbhMRkMaJaN77BV0GNZ%2BTBX7HTWR/CFmzOA/2nmiABQCfYgK3nsqQhzGaRxRUvdO9z4G50QcgwYaD36N0wWRBhGcyIaDyiqtVLd0qspqlCyhMKEH5xeXJT5pr3m8I2bqyVTznZmEtUPR5yVSCUXUekEAkggA

mhoemmen · 2022-08-31T22:54:26Z

Hi @jeffhammond ! Here is how I would rewrite that above godbolt example.

using double_and_int = std::tuple<double, int>;

template<class ElementType, std::integral Size, class ElementFunction>
requires(std::is_invocable_r_v<ElementType, ElementFunction, Size>)
void uninitialized_generate_n(ElementType raw_array[], const Size num_elements, ElementFunction&& f)
{
    auto range = std::views::iota(Size(0), num_elements);
    std::for_each(std::execution::par, std::ranges::begin(range), std::ranges::end(range),
      [raw_array, &f](const Size k) { std::construct_at(raw_array + k, f(k)); });
}

template<std::integral Size, class ElementFunction>
requires(std::is_invocable_v<ElementFunction, Size>)
std::unique_ptr< decltype(std::declval<ElementFunction>()(Size{}))[] >
create_array_and_generate_elements(const Size num_elements, ElementFunction&& f)
{
    using element_type = decltype(std::declval<ElementFunction>()(Size{}));
    auto arr = std::make_unique<element_type[]>(num_elements);
    uninitialized_generate_n(arr.get(), num_elements, std::forward<ElementFunction>(f));
    return arr;
}

int main()
{
    constexpr std::size_t N = 5;
    std::array<double, N> x{5.0, 1.0, 4.0, 2.0, 3.0};
    std::array<int, N> i{5, 1, 4, 2, 3};
    auto aos = create_array_and_generate_elements(N, [=](std::size_t k) { return std::tuple{x[k], i[k]}; });
    for(std::size_t k = 0; k < N; ++k) {
        std::cout << "k: {" << std::get<0>(aos[k]) << ", " << std::get<1>(aos[k]) << "}\n";
    }
    return 0;
}

The parallel ranges experience is not very good, though: https://godbolt.org/z/1Gadz7M5r

MSVC claims that iota_view doesn't produce forward iterators -- a nicely readable error message! Clang 14.0.0 generates a lot of angry output. A custom counting iterator type could also work.

jeffhammond · 2022-09-01T12:06:58Z

Unfortunately resize will default construct items in memory sequentially, which will zero storage for types that are trivially constructible.

Sorry, but why is this bad? Because initialization faults pages and causes problems with NUMA (including GPU migratable memory)? Or because sequential initialization is slow?

Given that std::sort is O(N log N), how likely is zero initialization going to be the performance bottleneck? If I'm going to have a sequential bottleneck, I'd much rather it look like memset than a loop over emplace_back, given that fast zeroing of memory is an idiom that can and is optimized in hardware, whereas the latter is surely not.

And finally, while this pattern might be bottleneck in RAJAPerf, how representative is it of LLNL applications to have allocation and initialization of a vectory of pairs on the critical path?

MrBurmark · 2022-09-01T16:46:25Z

Those are both things that I would think of, but I don't have the most experience with cpu threading so I'll take your word for it that they're not that bad. We would probably run at least 1 MPI process per NUMA domain so that may not be an issue anyway.

I don't know a lot about how hardware acceleration of zeroing works, does that get accelerated if there is padding that isn't zeroed? How much does the factor of P affect the calculation on whether an extra O(N) matters or not?

I'm not aware of any places where sort pairs is a significant performance bottleneck for LLNL applications, so effort would probably be better spent elsewhere.

MrBurmark · 2022-09-01T16:57:33Z

@mhoemmen Do you know if uninitialized_generate was proposed for the stl?

I also planned to use std::make_unique<element_type[]> before I realized that it would generate a zeroing loop, at .L2 in the gcc output. What's your opinion on that extra zeroing?

MrBurmark · 2022-09-01T18:15:41Z

Actually, isn't it undefined behavior to construct_at objects into the memory allocated via std::make_unique<element_type[]> because the memory is already initialized with the array of objects?

mhoemmen · 2022-09-01T22:54:34Z

@MrBurmark wrote:

... because the memory is already initialized with the array of objects?

That's a good point, thank you! I had forgotten that make_unique value-initializes the elements of the array. make_unique_for_overwrite default-initializes the objects. Unfortunately, default-initializing a tuple<double,int> value-initializes the elements.

The fix is to use custom allocation and deallocation. tuple<double, int> is trivially destructible, because all of its template arguments are trivially destructible. Thus, we don't need to invoke ~tuple() on the elements of the array before freeing it.

template<class ElementType>
requires(std::is_trivially_destructible_v<ElementType>)
struct deleter {
  void operator()(ElementType* p) { free(p); }
};

template<class ElementType>
std::unique_ptr<ElementType[], deleter<ElementType>>
allocate_uninitialized_array(const std::size_t size)
{
  return {
    reinterpret_cast<ElementType*>(malloc(size * sizeof(ElementType))), 
    deleter<ElementType>{}
  };
}

template<std::integral Size, class ElementFunction>
requires(std::is_invocable_v<ElementFunction, Size>)
std::unique_ptr< decltype(std::declval<ElementFunction>()(Size{}))[] >
create_array_and_generate_elements(const Size num_elements, ElementFunction&& f)
{
    using element_type = decltype(std::declval<ElementFunction>()(Size{}));
    auto arr = allocate_uninitialized_array<element_type[]>(num_elements);
    uninitialized_generate_n(arr.get(), num_elements, std::forward<ElementFunction>(f));
    return arr;
}

rhornung67 added enhancement question labels Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

does SORTPAIRS need to use reserve+emplace_back? #262

does SORTPAIRS need to use reserve+emplace_back? #262

jeffhammond commented Aug 31, 2022

MrBurmark commented Aug 31, 2022

mhoemmen commented Aug 31, 2022 •

edited

Loading

jeffhammond commented Sep 1, 2022 •

edited

Loading

MrBurmark commented Sep 1, 2022

MrBurmark commented Sep 1, 2022

MrBurmark commented Sep 1, 2022

mhoemmen commented Sep 1, 2022 •

edited

Loading

does SORTPAIRS need to use reserve+emplace_back? #262

does SORTPAIRS need to use reserve+emplace_back? #262

Comments

jeffhammond commented Aug 31, 2022

old

new

MrBurmark commented Aug 31, 2022

mhoemmen commented Aug 31, 2022 • edited Loading

jeffhammond commented Sep 1, 2022 • edited Loading

MrBurmark commented Sep 1, 2022

MrBurmark commented Sep 1, 2022

MrBurmark commented Sep 1, 2022

mhoemmen commented Sep 1, 2022 • edited Loading

mhoemmen commented Aug 31, 2022 •

edited

Loading

jeffhammond commented Sep 1, 2022 •

edited

Loading

mhoemmen commented Sep 1, 2022 •

edited

Loading