Fix NaNs when kpad_mask is all zeros over a tile #15

AntonOresten · 2025-10-20T13:27:00Z

We encountered NaNs when kpad_mask had a lot of zeros. The solution in this PR was somewhat vibe-coded, so may or may not be the simplest solution, but it seems to work.

AntonOresten · 2025-11-08T10:51:15Z

Not sure about this one tbh. Preferably, there'd be a sequence lengths vector of length B passed instead, where we'd skip trailing tiles entirely. Fairly straight-forward for keys, not sure about queries.

AntonOresten · 2025-12-04T16:07:51Z

We saw NaNs again specifically on a branch without these changes, just for the record.

pxl-th · 2025-12-04T18:22:37Z

Can you share MWE?

AntonOresten · 2025-12-04T18:41:53Z

Show groupsize

julia> @eval NNop begin
       function flash_attention_groupsize(::Type{T}; emb_dim::Int, target_shmem::UInt64) where T
           # TODO
           # - return `qk_fp16` to configure kernel
           # - optional qk_fp16
           # qk_fp16s = (false, true)
           # TODO prefer bigger groupsize?
           qk_fp16s = (true,)
           for qk_fp16 in qk_fp16s, groupsize in (256, 128, 64, 32, 16)
               shmem = flash_attention_shmem_bwd(T; emb_dim, groupsize, qk_fp16)
               shmem ≤ target_shmem && begin @show groupsize; return groupsize end
           end
           error("Failed to find groupsize for Flash Attention that satisfies Shared Memory constraint.")
       end
       end
flash_attention_groupsize (generic function with 1 method)

julia> begin
           H, L = 64, 64
           pad = 32
           x = CUDA.rand(H, L, 1, 1);
           kpad_mask = CuArray([trues(L-pad); falses(pad);;]);
           any(isnan, NNop.flash_attention(x, x, x; causal=false, kpad_mask))
       end
groupsize = 32
true

julia> begin
           H, L = 64, 64
           pad = 31
           x = CUDA.rand(H, L, 1, 1);
           kpad_mask = CuArray([trues(L-pad); falses(pad);;]);
           any(isnan, NNop.flash_attention(x, x, x; causal=false, kpad_mask))
       end
groupsize = 32
false

When L is not a multiple of groupsize: L = 65 and trying pad = {0, 1} also gives NaNs

AntonOresten and others added 4 commits October 16, 2025 15:07

fix kpad edge-case

17d8934

bump

82342f7

remove Revise

7977878

Merge branch 'pxl-th:master' into kpad-fix

0f3b92c

AntonOresten marked this pull request as draft November 8, 2025 11:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix NaNs when kpad_mask is all zeros over a tile #15

Fix NaNs when kpad_mask is all zeros over a tile #15

Uh oh!

AntonOresten commented Oct 20, 2025

Uh oh!

AntonOresten commented Nov 8, 2025 •

edited

Loading

Uh oh!

AntonOresten commented Dec 4, 2025

Uh oh!

pxl-th commented Dec 4, 2025

Uh oh!

AntonOresten commented Dec 4, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix NaNs when kpad_mask is all zeros over a tile #15

Are you sure you want to change the base?

Fix NaNs when kpad_mask is all zeros over a tile #15

Uh oh!

Conversation

AntonOresten commented Oct 20, 2025

Uh oh!

AntonOresten commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AntonOresten commented Dec 4, 2025

Uh oh!

pxl-th commented Dec 4, 2025

Uh oh!

AntonOresten commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AntonOresten commented Nov 8, 2025 •

edited

Loading

AntonOresten commented Dec 4, 2025 •

edited

Loading