You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have read the description in the 3.2.x README. Currently, does the wrap specialization still not support writing kernel similar to FlashAttention v3 due to the lack of multi-level async task implementation (only one Producer and one Consumer, can not do task0 -> task1 -> task2) ? I look forward to and appreciate your response.
The text was updated successfully, but these errors were encountered:
The warp specialization support that comes with 3.2.x is underpinned by an automatic task partition heuristics, which, for flash attention, will enable a cooperative partition scheme. This means either one-producer-one-consumer mode, or one-producer-dual-consumer mode is supported. In the latter, the two consumer groups will run exactly same code but on different parts of the kernel input. This is similar to what FA3 has adopted.
The task0->task1->task2 partition mode is not supported by the current automatic partition heuristics, though it is supported by the underlying code generation machinery, (known as arbitrary data channel, compared to the cooperative load-mma channel). We will be improving the automatic partition heuristics to include that partition mode, likely based on some latency modeling and analysis.
I have read the description in the 3.2.x README. Currently, does the wrap specialization still not support writing kernel similar to FlashAttention v3 due to the lack of multi-level async task implementation (only one Producer and one Consumer, can not do task0 -> task1 -> task2) ? I look forward to and appreciate your response.
The text was updated successfully, but these errors were encountered: