WIP: FlashAttention for WebGPU EP #22919

sushraja-msft · 2024-11-21T19:14:21Z

WIP: Implementation of FlashAttention that works for MHA

Currently only works on machines where the subgroup size is the same as tile size. (Intel)
Works only for the condition of new sequence length is 1.

The other scenarios require more debugging, algorithm needs optimization as well for the 1 seq length case because workgroups are left unused in how ComputeDotProduct is invoked.

sushraja-msft added 6 commits November 11, 2024 10:30

FA Base - Does Not Work

52656bf

The new Copy KV Cache works.

ed8bf5d

Add flash attention

75aa49d

Integrate FA

58157c5

Try fix the divide by zero issue

80296aa

FA works onn intel (TILE_SIZE == SUBGROUP_SIZE) for seq length of 1.

c281f84

sushraja-msft changed the title ~~User/sushraja/fa attempt2~~ WIP: FlashAttention for WebGPU EP Nov 21, 2024

Variable renames

de32d1f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: FlashAttention for WebGPU EP #22919

WIP: FlashAttention for WebGPU EP #22919

sushraja-msft commented Nov 21, 2024 •

edited

Loading

WIP: FlashAttention for WebGPU EP #22919

Are you sure you want to change the base?

WIP: FlashAttention for WebGPU EP #22919

Conversation

sushraja-msft commented Nov 21, 2024 • edited Loading

sushraja-msft commented Nov 21, 2024 •

edited

Loading