thd qkv-format in transformer layer #1383

robot-transformer · 2024-12-22T21:28:44Z

Hello!
I'm currently trying to rewrite my pipeline with TE. I use merged sequences for LM and as far as I know I should use "thd" format for it.
I see that MultiheadAttention class (from here) doesn't support this format (as there is no mention of tithed in args annotation). But DotProductAttention seems to be support "thb".
When I pass qkv_format = "thd" in transformer layer it looks like the only reason why it doesn't work is that in MultiheadAttention we need to pass cu_seqlens to DotProductAttention. Am I correct about it? Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thd qkv-format in transformer layer #1383

thd qkv-format in transformer layer #1383

robot-transformer commented Dec 22, 2024

thd qkv-format in transformer layer #1383

thd qkv-format in transformer layer #1383

Comments

robot-transformer commented Dec 22, 2024