Skip to content

Conversation

zhoutianzi666
Copy link

No description provided.

remove atomic_clean_flag in combine
@yuantailing
Copy link
Contributor

yuantailing commented Sep 23, 2025

If atomic_clean_flag is removed from combine, the program may execute in the following order:

  1. rank 0 and rank 1 start combine functions
  2. rank 0 and rank 1 add rdma_recv_flag, respectively (L745-L749)
  3. rank 1 finishes the combine function
  4. rank 1 starts the next dispatch function and adds dispatch_rdma_recv_count_buffer
  5. rank 0 writes next_clean (aka dispatch_rdma_recv_count_buffer) to zero (L607-L608)

The data is corrupted because 5 happens after 4.
If there is an atomic_clean_flag, the order above is impossible because 3 cannot happen before 5.

@alpha-baby
Copy link
Contributor

There is a detail here: if a rank does not write someone else's dispatch_rdma_recv_count_buffer , it is impossible for other ranks to enter the next round of DeepEP kernel (dispatch or combine). Therefore, it is necessary to ensure that the next_clean buffer is cleared before writing someone else's buffer.

Because we want to ensure that the current rank cleans up its next_clean buffer and others write their own next_clean buffer concurrently.

@zhoutianzi666
Copy link
Author

zhoutianzi666 commented Sep 25, 2025

谢谢大佬们,我知道了,atomic_clean_flag这个变量的引入 就是要保证所有卡上next_clean 被清空完了,然后大家再发起all2all这个行为,否则的话会影响下一轮的Dispatch 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants