Fix racing condition in large batch size #440

fzyzcjy · 2025-10-02T11:43:47Z

In internode_ll.cu, the logic is:

                lane_id == 0 ? atomic_add_release_global(atomic_finish_counter_per_expert + dst_expert_idx, 1) : 0;
...
                atomic_add_release_global(atomic_finish_counter_per_expert + i, FINISHED_SUM_TAG);
...
                atomic_add_release_global(atomic_finish_counter_per_expert + i, FINISHED_SUM_TAG - sum);
...
        while (ld_acquire_global(atomic_finish_counter_per_expert + responsible_expert_idx) != FINISHED_SUM_TAG * 2);

In my naive understanding, if the batch size is >1024 (which is possible since a extreme case can have bs 1400), then it is possible that after sending the first 1024 tokens, we already reach the last while condition (since FINISHED_SUM_TAG=1024), and we send signals prematurely, causing bugs.

Update configs.cuh

f00ce65

fzyzcjy changed the title ~~Avoid racing condition in large batch size~~ Fix racing condition in large batch size Oct 2, 2025

fzyzcjy mentioned this pull request Oct 2, 2025

Allow larger max num dispatch tokens per rank for DeepEP sgl-project/sglang#11168

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix racing condition in large batch size #440

Fix racing condition in large batch size #440

Uh oh!

fzyzcjy commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix racing condition in large batch size #440

Are you sure you want to change the base?

Fix racing condition in large batch size #440

Uh oh!

Conversation

fzyzcjy commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant