Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][gpu-objects] Driver should order all collective calls to avoid deadlock #51264

Open
kevin85421 opened this issue Mar 11, 2025 · 0 comments
Assignees
Labels
core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability gpu-objects P0 Issues that should be fixed in short order

Comments

@kevin85421
Copy link
Member

kevin85421 commented Mar 11, 2025

Description

Similar to compiled graphs, the driver should order all collective calls to avoid deadlocks.

Example 1:

  • Avoid passing tensors within the same actor using NCCL. Instead, we should access the in-actor store directly.

Example 2: Both actors are single-threaded and synchronous. If t1_1 is the input for t2_2 and t1_2 is the input for t2_1, both use NCCL to transfer data. In this case, we should call NCCL recv of t2_2 before t2_1 to avoid deadlock.

Actor 1: t1_1, t1_2
Actor 2: t2_1, t2_2

Note: Check if this will work if we only have one CUDA stream.

Use case

No response

@kevin85421 kevin85421 added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) P0 Issues that should be fixed in short order gpu-objects and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 11, 2025
@kevin85421 kevin85421 self-assigned this Mar 11, 2025
@kevin85421 kevin85421 added the core Issues that should be addressed in Ray Core label Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability gpu-objects P0 Issues that should be fixed in short order
Projects
None yet
Development

No branches or pull requests

1 participant