Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement seed finding Thunderdome in CUDA #410

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

stephenswat
Copy link
Member

@stephenswat stephenswat commented May 26, 2023

This draft pull request implements a Thunderdome-like fight to the death for seeds: "Two seeds enter, one seed leaves" . The idea is to allow seeds to "eat" other seeds, growing bigger themselves and eliminating other seeds from the seed list. This is designed to significantly reduce the combinatorics that the CKF has to deal with, thereby improving performance.

image

To give you an idea of what this does, here is an example of the seeding example without Thunderdoming on ten ⟨µ⟩ = 300 events:

==> Seed finding efficiency ...
- Particle filter      : Charged with |η| ≤ 2.70 and pT ≥ 1.000 GeV
- Particle matcher     : Stepped with ≥ 60.0% similarity
- Total seeds          :  294740
- True seeds           :  208530
- False seeds          :   86210
- Total tracks         :   13076
- Matched tracks       :   11632
- Unmatched tracks     :    1444
- Precision            :  70.75%
- Fake rate            :  29.25%
- Recall/Efficiency    :  88.96%

A relatively conservative set of Thunderdome rules reduces the number of seeds for the CKF to deal with significantly:

==> Seed finding efficiency ...
- Particle filter      : Charged with |η| ≤ 2.70 and pT ≥ 1.000 GeV
- Particle matcher     : Stepped with ≥ 60.0% similarity
- Total seeds          :  227299
- True seeds           :  141617
- False seeds          :   85682
- Total tracks         :   13076
- Matched tracks       :   11629
- Unmatched tracks     :    1447
- Precision            :  62.30%
- Fake rate            :  37.70%
- Recall/Efficiency    :  88.93%

If we turn up the Mad Max dial, we can reduce the number of seeds far more

==> Seed finding efficiency ...
- Particle filter      : Charged with |η| ≤ 2.70 and pT ≥ 1.000 GeV
- Particle matcher     : Stepped with ≥ 60.0% similarity
- Total seeds          :  101139
- True seeds           :   64250
- False seeds          :   36889
- Total tracks         :   13076
- Matched tracks       :   11528
- Unmatched tracks     :    1548
- Precision            :  63.53%
- Fake rate            :  36.47%
- Recall/Efficiency    :  88.16%

Hopefully this reduction in the number of seeds will reduce CKF runtime. Of course, we will need to tweak the rules to reduce physics performance loss, but I think this will be possible.

@stephenswat stephenswat added feature New feature or request cuda Changes related to CUDA labels May 26, 2023
Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not look like it from my comments, but I like this proposal. I believe it will prove very useful once we get to making CKF as fast as we can.

I'm just a stickler for some of the technicalities...

Comment on lines +61 to 63
public:
std::size_t _size;
std::array<link_type, N> _sps;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If client code is meant to access these variables now, they should really not be prefixed with _.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely correct, I plan on renaming these and making some more changes to the nseed class during the development of this PR. Well spotted. 👍


namespace traccc::cuda {

class seed_merging : public algorithm<std::pair<vecmem::unique_alloc_ptr<nseed<20>[]>, uint32_t>(const seed_collection_types::buffer&)> {
Copy link
Member

@krasznaa krasznaa May 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why nseed<20>? I thought we were only going to merge 2 seeds together in this algorithm. At least for now. So why not nseed<4>? 😕

At the same time the return type should really instead be vecmem::data::vector_buffer<nseed<4> >. Since that type is literally what you wrote here. (An array with a known size.)

Copy link
Member Author

@stephenswat stephenswat May 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, sorry if I did not explain properly. The five-iteration loop you commented on below means that seeds can become at most five spacepoints longer than they are now. For example, seed ABC can be merged with BCD to form ABCD in step 1, then that can merge with BCDE in step 2, etc. So In principle the seed capacity needs to be 3 + n where n is the number of merging steps Thunderdome rounds.

Comment on lines +11 to +18
__global__ void convert_to_nseeds(seed_collection_types::view vf, nseed<N> * out, unsigned long long * out_n) {
seed_collection_types::device vfd(vf);

for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < vf.size(); i += gridDim.x * blockDim.x) {
out[i] = vfd[i];
atomicAdd(out_n, 1ULL);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this kernel... 😕

What's up with the atomicAdd(...)? We know exactly how many (3-spacepoint) seeds we start with. As long as the kernel code doesn't have a bug in it, that's exactly how large out_n is supposed to become.

At the same time, you'll need to put some amount of explanation here about that for-loop. I assume you're going for optimal cache line usage with it. Though on first look I don't understand why this access would yield better caching than just processing a single seed in every thread. 😕

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely correct that this doesn't make sense right now. What I want to do in the future is turn this kernel into a seed binning kernel that stores seeds into φ-bins to reduce the amount of work done. When I get that sorted I will need to do atomic space reservations. For now, with only one bin, this is indeed not really necessary.

seed_merging(const traccc::memory_resource& mr, stream& str);

output_type operator()(
const seed_collection_types::buffer&) const override;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input parameter needs to be seed_collection_type::const_view. The algorithm doesn't need to know whether it's a buffer or something else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines +120 to +122
vecmem::unique_alloc_ptr<nseed<20>[]>
arr1 = vecmem::make_unique_alloc<nseed<20>[]>(m_mr.main, 1000000),
arr2 = vecmem::make_unique_alloc<nseed<20>[]>(m_mr.main, 1000000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's with the 1M fixed numbers? 😕 Just ask the input view how many elements it has. I.e.

const unsigned int nseeds = m_copy.get_size(input_view);

Assuming that you switch to using a view, and that you ask the user to provide a vecmem::copy object to the algorithm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the thing with this is that it is possible for the number of seeds to increase, at least in early steps of the algorithms. So I need a little bit of buffer space for that. But you're right I can do two times the number of initial seeds, for example.

siz1 = vecmem::make_unique_alloc<unsigned long long>(m_mr.main),
siz2 = vecmem::make_unique_alloc<unsigned long long>(m_mr.main);

kernels::convert_to_nseeds<20><<<2048, 256>>>(i, arr1.get(), siz1.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to write new code with asynchronicity in mind. But this is just to note this, I'm fine with this PR not caring about asynchronicity yet.

Copy link
Member Author

@stephenswat stephenswat May 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make sure to add the appropriate synchronization points later. This algorithm is also a prime candidate for Dynamic Parallelism 2, which might reduce the number of synchronization points. 🙂


std::cout << "Step 0 has " << rc << " seeds." << std::endl;

for (std::size_t i = 0; i < 5; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine 5 is a number that you came up with after some testing. That is fine, but it has to be made a configurable property of the algorithm...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely.


for (std::size_t i = 0; i < 5; ++i) {
CUDA_ERROR_CHECK(cudaMemset(siz2.get(), 0, sizeof(unsigned long long)));
kernels::merge_nseeds<20, 20><<<rc, 256>>>(arr1.get(), siz1.get(), arr2.get(), siz2.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious. I thought the compiler would be able to deduce those template parameters automatically... 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably can, actually! I added them explicitly for no real good reason, honestly.

Comment on lines +146 to +147
std::swap(arr1, arr2);
std::swap(siz1, siz2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. 😄 You should've mentioned that you're taking queues from the discussion we had with Beomki about AdePT's memory handling. 😛

Copy link
Member Author

@stephenswat stephenswat May 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to toot my own horn but I've been on this train for a while. 👼

@stephenswat
Copy link
Member Author

Thanks a lot for taking a look, glad you like it! It's still very early development but you've raised some valid points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda Changes related to CUDA feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants