-
Notifications
You must be signed in to change notification settings - Fork 116
TL/UCP: Split single and multithreaded send/receive #1109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
b55e4e2
to
49a001f
Compare
c0ea317
to
f3ebc45
Compare
Sergei-Lebedev
approved these changes
Apr 8, 2025
7048da1
to
bded84f
Compare
janjust
approved these changes
Apr 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
samnordmann
approved these changes
Apr 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
bded84f
to
6607294
Compare
bot:retest |
MamziB
pushed a commit
to MamziB/ucc-forked
that referenced
this pull request
Jul 9, 2025
* TL/UCP: completion callback st/mt * TL/UCP: ucc_tl_ucp_send_nb callback * TL/UCP: recv implementation * TL/UCP: fix conflict * TL/UCP: disable clang tidy error * TL/UCP: non zero versions * TL/UCP: rename and format
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What
This change introduces two separate code paths for single-threaded and multi-threaded scenarios. During context creation, a specific set of functions and callbacks is selected based on the threading mode, avoiding branching in performance-critical (hot) paths. The single-threaded implementation avoids the use of atomics, which are unnecessary in that context and have been shown to impact performance on ARM systems (3–7% regression on small messages).
Why ?
Recent performance analysis has revealed regressions in TL/UCP, particularly noticeable on ARM platforms. The atomics introduced in #932 addressed correctness in multi-threaded environments but introduced overhead even in single-threaded use cases. This PR mitigates that overhead for single-threaded configurations. A future update will address multi-threaded performance by revising atomic operations to use appropriate memory models for ARM.
OSU Allgather benchmark on AMD x86, 100k iterations for better stability in results on small messages, Optimized - code from this PR.
Grace (arm) 8 nodes 1ppn OSU alltoall cuda memory 100k iterations (before diff was 3-7%):