flexattn with qwen2 #81

NonvolatileMemory · 2024-11-18T13:07:51Z

seems flexattn cannot support numheads=28?

drisspg · 2024-11-18T22:07:05Z

Do you have a repro? I just tried this and it appears to be working for me. Notably, I'm on Nightly version of pytorch

import torch

from torch.nn.attention.flex_attention import flex_attention, create_block_mask


def causal_mask(b, h, q_idx, kv_idx):
   return q_idx >= kv_idx


b, h, s, d = 1, 28, 256, 64
tens = torch.rand(b, h, s, d, device="cuda")

flex = torch.compile(flex_attention)

bm = create_block_mask(causal_mask, None, None, s, s)

print(flex(tens, tens, tens, block_mask=bm))

NonvolatileMemory · 2024-11-20T06:27:10Z

Hi!

Here is my code

def diff(bsz=4, seq_len=1024, d_head=128, num_heads=28, block_size=4):
    # torch_attn

    Q = torch.randn(bsz, num_heads, seq_len, d_head)#.cuda()
    K = torch.randn(bsz, 4, seq_len, d_head)#.cuda()
    V = torch.randn(bsz, 4, seq_len, d_head)#.cuda()

    scores = torch.matmul(Q, K.permute(0, 1, 3, 2)) / (Q.size(-1) ** 0.5)

    q_idx = torch.arange(seq_len).view(-1, 1)
    kv_idx = torch.arange(seq_len).view(1, -1)
    mask = torch_mask(q_idx, kv_idx, block_size)[None, None, :, :].cuda()

    # scores = scores.masked_fill(~mask, float('-inf'))
    # attn_weights = F.softmax(scores, dim=-1)
    # torch_out = torch.matmul(attn_weights, V)
    sub_block_mask = create_block_mask(block_mask, B=None, H=None, Q_LEN=seq_len, KV_LEN=seq_len,  _compile=True)
    flex_out = flex_attn(Q, K, V, block_mask=sub_block_mask, enable_gqa=True)
    return flex_out
    # return (flex_out[:, :, 16:] - torch_out[:, :, 16:]).max()
    
def block_mask(b, h, q_idx, kv_idx):
    q_block = q_idx // 4
    kv_block = kv_idx // 4
    return q_block > kv_block
    ```

NonvolatileMemory · 2024-11-20T06:28:02Z

Do you have a repro? I just tried this and it appears to be working for me. Notably, I'm on Nightly version of pytorch

import torch

from torch.nn.attention.flex_attention import flex_attention, create_block_mask


def causal_mask(b, h, q_idx, kv_idx):
   return q_idx >= kv_idx


b, h, s, d = 1, 28, 256, 64
tens = torch.rand(b, h, s, d, device="cuda")

flex = torch.compile(flex_attention)

bm = create_block_mask(causal_mask, None, None, s, s)

print(flex(tens, tens, tens, block_mask=bm))

Maybe because I am using the 2.5.0 ver of torch instead of nightly?

drisspg · 2024-11-20T17:16:37Z

Yeah, potentially. Would you mind trying nightly?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flexattn with qwen2 #81

flexattn with qwen2 #81

NonvolatileMemory commented Nov 18, 2024

drisspg commented Nov 18, 2024

NonvolatileMemory commented Nov 20, 2024

NonvolatileMemory commented Nov 20, 2024

drisspg commented Nov 20, 2024

flexattn with qwen2 #81

flexattn with qwen2 #81

Comments

NonvolatileMemory commented Nov 18, 2024

drisspg commented Nov 18, 2024

NonvolatileMemory commented Nov 20, 2024

NonvolatileMemory commented Nov 20, 2024

drisspg commented Nov 20, 2024