Implement seed finding Thunderdome in CUDA #410

stephenswat · 2023-05-26T18:07:54Z

This draft pull request implements a Thunderdome-like fight to the death for seeds: "Two seeds enter, one seed leaves" . The idea is to allow seeds to "eat" other seeds, growing bigger themselves and eliminating other seeds from the seed list. This is designed to significantly reduce the combinatorics that the CKF has to deal with, thereby improving performance.

To give you an idea of what this does, here is an example of the seeding example without Thunderdoming on ten ⟨µ⟩ = 300 events:

==> Seed finding efficiency ...
- Particle filter      : Charged with |η| ≤ 2.70 and pT ≥ 1.000 GeV
- Particle matcher     : Stepped with ≥ 60.0% similarity
- Total seeds          :  294740
- True seeds           :  208530
- False seeds          :   86210
- Total tracks         :   13076
- Matched tracks       :   11632
- Unmatched tracks     :    1444
- Precision            :  70.75%
- Fake rate            :  29.25%
- Recall/Efficiency    :  88.96%

A relatively conservative set of Thunderdome rules reduces the number of seeds for the CKF to deal with significantly:

==> Seed finding efficiency ...
- Particle filter      : Charged with |η| ≤ 2.70 and pT ≥ 1.000 GeV
- Particle matcher     : Stepped with ≥ 60.0% similarity
- Total seeds          :  227299
- True seeds           :  141617
- False seeds          :   85682
- Total tracks         :   13076
- Matched tracks       :   11629
- Unmatched tracks     :    1447
- Precision            :  62.30%
- Fake rate            :  37.70%
- Recall/Efficiency    :  88.93%

If we turn up the Mad Max dial, we can reduce the number of seeds far more

==> Seed finding efficiency ...
- Particle filter      : Charged with |η| ≤ 2.70 and pT ≥ 1.000 GeV
- Particle matcher     : Stepped with ≥ 60.0% similarity
- Total seeds          :  101139
- True seeds           :   64250
- False seeds          :   36889
- Total tracks         :   13076
- Matched tracks       :   11528
- Unmatched tracks     :    1548
- Precision            :  63.53%
- Fake rate            :  36.47%
- Recall/Efficiency    :  88.16%

Hopefully this reduction in the number of seeds will reduce CKF runtime. Of course, we will need to tweak the rules to reduce physics performance loss, but I think this will be possible.

krasznaa

It might not look like it from my comments, but I like this proposal. I believe it will prove very useful once we get to making CKF as fast as we can.

I'm just a stickler for some of the technicalities...

krasznaa · 2023-05-27T13:03:43Z

core/include/traccc/edm/nseed.hpp

+    public:
    std::size_t _size;
    std::array<link_type, N> _sps;


If client code is meant to access these variables now, they should really not be prefixed with _.

Absolutely correct, I plan on renaming these and making some more changes to the nseed class during the development of this PR. Well spotted. 👍

krasznaa · 2023-05-27T13:06:34Z

device/cuda/include/traccc/cuda/seed_merging/seed_merging.hpp

+
+namespace traccc::cuda {
+
+class seed_merging : public algorithm<std::pair<vecmem::unique_alloc_ptr<nseed<20>[]>, uint32_t>(const seed_collection_types::buffer&)> {


Why nseed<20>? I thought we were only going to merge 2 seeds together in this algorithm. At least for now. So why not nseed<4>? 😕

At the same time the return type should really instead be vecmem::data::vector_buffer<nseed<4> >. Since that type is literally what you wrote here. (An array with a known size.)

Ah okay, sorry if I did not explain properly. The five-iteration loop you commented on below means that seeds can become at most five spacepoints longer than they are now. For example, seed ABC can be merged with BCD to form ABCD in step 1, then that can merge with BCDE in step 2, etc. So In principle the seed capacity needs to be 3 + n where n is the number of ~~merging steps~~ Thunderdome rounds.

krasznaa · 2023-05-27T13:15:18Z

device/cuda/src/seed_merging/seed_merging.cu

+__global__ void convert_to_nseeds(seed_collection_types::view vf, nseed<N> * out, unsigned long long * out_n) {
+    seed_collection_types::device vfd(vf);
+
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < vf.size(); i += gridDim.x * blockDim.x) {
+        out[i] = vfd[i];
+        atomicAdd(out_n, 1ULL);
+    }
+}


I don't understand this kernel... 😕

What's up with the atomicAdd(...)? We know exactly how many (3-spacepoint) seeds we start with. As long as the kernel code doesn't have a bug in it, that's exactly how large out_n is supposed to become.

At the same time, you'll need to put some amount of explanation here about that for-loop. I assume you're going for optimal cache line usage with it. Though on first look I don't understand why this access would yield better caching than just processing a single seed in every thread. 😕

You're absolutely correct that this doesn't make sense right now. What I want to do in the future is turn this kernel into a seed binning kernel that stores seeds into φ-bins to reduce the amount of work done. When I get that sorted I will need to do atomic space reservations. For now, with only one bin, this is indeed not really necessary.

krasznaa · 2023-05-27T13:17:54Z

device/cuda/include/traccc/cuda/seed_merging/seed_merging.hpp

+    seed_merging(const traccc::memory_resource& mr, stream& str);
+
+    output_type operator()(
+        const seed_collection_types::buffer&) const override;


The input parameter needs to be seed_collection_type::const_view. The algorithm doesn't need to know whether it's a buffer or something else.

krasznaa · 2023-05-27T13:21:07Z

device/cuda/src/seed_merging/seed_merging.cu

+    vecmem::unique_alloc_ptr<nseed<20>[]>
+        arr1 = vecmem::make_unique_alloc<nseed<20>[]>(m_mr.main, 1000000),
+        arr2 = vecmem::make_unique_alloc<nseed<20>[]>(m_mr.main, 1000000);


What's with the 1M fixed numbers? 😕 Just ask the input view how many elements it has. I.e.

const unsigned int nseeds = m_copy.get_size(input_view);

Assuming that you switch to using a view, and that you ask the user to provide a vecmem::copy object to the algorithm.

So the thing with this is that it is possible for the number of seeds to increase, at least in early steps of the algorithms. So I need a little bit of buffer space for that. But you're right I can do two times the number of initial seeds, for example.

krasznaa · 2023-05-27T13:22:21Z

device/cuda/src/seed_merging/seed_merging.cu

+        siz1 = vecmem::make_unique_alloc<unsigned long long>(m_mr.main),
+        siz2 = vecmem::make_unique_alloc<unsigned long long>(m_mr.main);
+
+    kernels::convert_to_nseeds<20><<<2048, 256>>>(i, arr1.get(), siz1.get());


We should try to write new code with asynchronicity in mind. But this is just to note this, I'm fine with this PR not caring about asynchronicity yet.

I'll make sure to add the appropriate synchronization points later. This algorithm is also a prime candidate for Dynamic Parallelism 2, which might reduce the number of synchronization points. 🙂

krasznaa · 2023-05-27T13:23:37Z

device/cuda/src/seed_merging/seed_merging.cu

+
+    std::cout << "Step 0 has " << rc << " seeds." << std::endl;
+
+    for (std::size_t i = 0; i < 5; ++i) {


I imagine 5 is a number that you came up with after some testing. That is fine, but it has to be made a configurable property of the algorithm...

Absolutely.

krasznaa · 2023-05-27T13:24:12Z

device/cuda/src/seed_merging/seed_merging.cu

+
+    for (std::size_t i = 0; i < 5; ++i) {
+        CUDA_ERROR_CHECK(cudaMemset(siz2.get(), 0, sizeof(unsigned long long)));
+        kernels::merge_nseeds<20, 20><<<rc, 256>>>(arr1.get(), siz1.get(), arr2.get(), siz2.get());


Curious. I thought the compiler would be able to deduce those template parameters automatically... 🤔

It probably can, actually! I added them explicitly for no real good reason, honestly.

krasznaa · 2023-05-27T13:26:15Z

device/cuda/src/seed_merging/seed_merging.cu

+        std::swap(arr1, arr2);
+        std::swap(siz1, siz2);


I like this. 😄 You should've mentioned that you're taking queues from the discussion we had with Beomki about AdePT's memory handling. 😛

Not to toot my own horn but I've been on this train for a while. 👼

stephenswat · 2023-05-27T13:50:48Z

Thanks a lot for taking a look, glad you like it! It's still very early development but you've raised some valid points.

Implement seed finding Thunderdome

f1f0b0a

stephenswat added feature New feature or request cuda Changes related to CUDA labels May 26, 2023

krasznaa requested changes May 27, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement seed finding Thunderdome in CUDA #410

Implement seed finding Thunderdome in CUDA #410

stephenswat commented May 26, 2023 •

edited

Loading

krasznaa left a comment

krasznaa May 27, 2023

stephenswat May 27, 2023

krasznaa May 27, 2023 •

edited

Loading

stephenswat May 27, 2023 •

edited

Loading

krasznaa May 27, 2023

stephenswat May 27, 2023

krasznaa May 27, 2023

stephenswat May 27, 2023

krasznaa May 27, 2023

stephenswat May 27, 2023

krasznaa May 27, 2023

stephenswat May 27, 2023 •

edited

Loading

krasznaa May 27, 2023

stephenswat May 27, 2023

krasznaa May 27, 2023

stephenswat May 27, 2023

krasznaa May 27, 2023

stephenswat May 27, 2023 •

edited

Loading

stephenswat commented May 27, 2023


		namespace traccc::cuda {

		class seed_merging : public algorithm<std::pair<vecmem::unique_alloc_ptr<nseed<20>[]>, uint32_t>(const seed_collection_types::buffer&)> {


		std::cout << "Step 0 has " << rc << " seeds." << std::endl;

		for (std::size_t i = 0; i < 5; ++i) {

Implement seed finding Thunderdome in CUDA #410

Are you sure you want to change the base?

Implement seed finding Thunderdome in CUDA #410

Conversation

stephenswat commented May 26, 2023 • edited Loading

krasznaa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krasznaa May 27, 2023 • edited Loading

Choose a reason for hiding this comment

stephenswat May 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephenswat May 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephenswat May 27, 2023 • edited Loading

Choose a reason for hiding this comment

stephenswat commented May 27, 2023

stephenswat commented May 26, 2023 •

edited

Loading

krasznaa May 27, 2023 •

edited

Loading

stephenswat May 27, 2023 •

edited

Loading

stephenswat May 27, 2023 •

edited

Loading

stephenswat May 27, 2023 •

edited

Loading