Skip to content

TL/UCP: Split single and multithreaded send/receive #1109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 11, 2025

Conversation

ikryukov
Copy link
Collaborator

@ikryukov ikryukov commented Apr 3, 2025

What

This change introduces two separate code paths for single-threaded and multi-threaded scenarios. During context creation, a specific set of functions and callbacks is selected based on the threading mode, avoiding branching in performance-critical (hot) paths. The single-threaded implementation avoids the use of atomics, which are unnecessary in that context and have been shown to impact performance on ARM systems (3–7% regression on small messages).

Why ?

Recent performance analysis has revealed regressions in TL/UCP, particularly noticeable on ARM platforms. The atomics introduced in #932 addressed correctness in multi-threaded environments but introduced overhead even in single-threaded use cases. This PR mitigates that overhead for single-threaded configurations. A future update will address multi-threaded performance by revising atomic operations to use appropriate memory models for ARM.

OSU Allgather benchmark on AMD x86, 100k iterations for better stability in results on small messages, Optimized - code from this PR.

Size(B)    Master(us)   1.3.2(us)    Optimized(us)  M vs 1.3.2(%)  O vs 1.3.2(%)  O vs M(%)   
------------------------------------------------------------------------------------------
1          94.34        98.67        75.71                 +4.58        +23.27        +19.75
2          83.13        80.98        76.41                 -2.59         +5.65         +8.09
4          84.19        83.67        75.83                 -0.61         +9.37         +9.92
8          645.81       663.58       634.07                +2.75         +4.45         +1.82
16         642.60       667.64       640.89                +3.90         +4.01         +0.27
32         664.89       745.26       667.67               +12.09        +10.41         -0.42

Grace (arm) 8 nodes 1ppn OSU alltoall cuda memory 100k iterations (before diff was 3-7%):

Size      1.3.2(us)    Current(us)    % Difference
----------------------------------------------------
1         9.79         9.83         +0.41%
2         9.78         9.82         +0.41%
4         9.76         9.81         +0.51%
8         9.77         9.82         +0.51%
16        9.76         9.80         +0.41%
32        9.89         9.92         +0.30%

@ikryukov ikryukov force-pushed the ucp_atomics_rework branch from b55e4e2 to 49a001f Compare April 3, 2025 12:33
@ikryukov ikryukov self-assigned this Apr 3, 2025
@ikryukov ikryukov changed the title TL/UCP: Split single and multiplethreaded send/receive TL/UCP: Split single and multithreaded send/receive Apr 3, 2025
@ikryukov ikryukov marked this pull request as ready for review April 4, 2025 10:50
@ikryukov ikryukov force-pushed the ucp_atomics_rework branch from c0ea317 to f3ebc45 Compare April 8, 2025 10:02
@ikryukov ikryukov force-pushed the ucp_atomics_rework branch from 7048da1 to bded84f Compare April 8, 2025 14:12
Copy link
Collaborator

@janjust janjust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@janjust janjust force-pushed the ucp_atomics_rework branch from bded84f to 6607294 Compare April 9, 2025 14:18
@janjust
Copy link
Collaborator

janjust commented Apr 10, 2025

bot:retest

@Sergei-Lebedev Sergei-Lebedev merged commit f0384ad into openucx:master Apr 11, 2025
8 checks passed
MamziB pushed a commit to MamziB/ucc-forked that referenced this pull request Jul 9, 2025
* TL/UCP: completion callback st/mt

* TL/UCP: ucc_tl_ucp_send_nb callback

* TL/UCP: recv implementation

* TL/UCP: fix conflict

* TL/UCP: disable clang tidy error

* TL/UCP: non zero versions

* TL/UCP: rename and format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants