You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The code is correct: seqlen_k is the total kv cache length. i think in your case, binfo.actual_seqlen_k should be strictly less than seqlen_k if there are new tokens to be appended.
Hi,
I am curious about why
n_blocks_per_split
is calculated usingparams.seqlen_k
instead ofactual_seqlen_k
in the following code:flash-attention/csrc/flash_attn/src/flash_fwd_kernel.h
Line 525 in b443207
It seems to be wrong in some cases.
Considering:
seqlen_k = 1024;
seqlen_k_new = 1;
BlockN = 128;
num_split = 4;
the
n_blocks_per_split
would be equal to 2. And thenn_block_max
can only reach a maximum of 8 ((3 + 1) * 2) according to:flash-attention/csrc/flash_attn/src/flash_fwd_kernel.h
Line 529 in b443207
If we attempt to append KV,
n_block_copy_min
is also equal to 8, which means there is no condition that allowsgKNew
to append togK
:flash-attention/csrc/flash_attn/src/flash_fwd_kernel.h
Line 727 in b443207
flash-attention/csrc/flash_attn/src/flash_fwd_kernel.h
Line 730 in b443207
Am I missing something here?
The text was updated successfully, but these errors were encountered: