Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can Warp Specialization feature in 3.2.x version support kernel like FlashAttention v3 ? #5860

Closed
hiworldwzj opened this issue Feb 8, 2025 · 3 comments

Comments

@hiworldwzj
Copy link

I have read the description in the 3.2.x README. Currently, does the wrap specialization still not support writing kernel similar to FlashAttention v3 due to the lack of multi-level async task implementation (only one Producer and one Consumer, can not do task0 -> task1 -> task2) ? I look forward to and appreciate your response.

@hiworldwzj
Copy link
Author

@htyu Can you answer this question? thanks.

@htyu
Copy link
Collaborator

htyu commented Feb 8, 2025

The warp specialization support that comes with 3.2.x is underpinned by an automatic task partition heuristics, which, for flash attention, will enable a cooperative partition scheme. This means either one-producer-one-consumer mode, or one-producer-dual-consumer mode is supported. In the latter, the two consumer groups will run exactly same code but on different parts of the kernel input. This is similar to what FA3 has adopted.

The task0->task1->task2 partition mode is not supported by the current automatic partition heuristics, though it is supported by the underlying code generation machinery, (known as arbitrary data channel, compared to the cooperative load-mma channel). We will be improving the automatic partition heuristics to include that partition mode, likely based on some latency modeling and analysis.

@hiworldwzj
Copy link
Author

@htyu thanks very much。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants