Skip to content

Conversation

fzyzcjy
Copy link
Contributor

@fzyzcjy fzyzcjy commented Oct 2, 2025

In internode_ll.cu, the logic is:

                lane_id == 0 ? atomic_add_release_global(atomic_finish_counter_per_expert + dst_expert_idx, 1) : 0;
...
                atomic_add_release_global(atomic_finish_counter_per_expert + i, FINISHED_SUM_TAG);
...
                atomic_add_release_global(atomic_finish_counter_per_expert + i, FINISHED_SUM_TAG - sum);
...
        while (ld_acquire_global(atomic_finish_counter_per_expert + responsible_expert_idx) != FINISHED_SUM_TAG * 2);

In my naive understanding, if the batch size is >1024 (which is possible since a extreme case can have bs 1400), then it is possible that after sending the first 1024 tokens, we already reach the last while condition (since FINISHED_SUM_TAG=1024), and we send signals prematurely, causing bugs.

@fzyzcjy fzyzcjy changed the title Avoid racing condition in large batch size Fix racing condition in large batch size Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant