Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about flashdecoding with appendKV #1325

Open
DD-DuDa opened this issue Nov 10, 2024 · 1 comment
Open

Question about flashdecoding with appendKV #1325

DD-DuDa opened this issue Nov 10, 2024 · 1 comment

Comments

@DD-DuDa
Copy link

DD-DuDa commented Nov 10, 2024

Hi,
I am curious about why n_blocks_per_split is calculated using params.seqlen_k instead of actual_seqlen_k in the following code:

const int n_blocks_per_split = ((params.seqlen_k + kBlockN - 1) / kBlockN + num_n_splits - 1) / num_n_splits;

It seems to be wrong in some cases.

Considering:
seqlen_k = 1024;
seqlen_k_new = 1;
BlockN = 128;
num_split = 4;

the n_blocks_per_split would be equal to 2. And then n_block_max can only reach a maximum of 8 ((3 + 1) * 2) according to:

int n_block_max = std::min(cute::ceil_div(binfo.actual_seqlen_k, kBlockN), (n_split_idx + 1) * n_blocks_per_split);

If we attempt to append KV, n_block_copy_min is also equal to 8, which means there is no condition that allows gKNew to append to gK:

const int n_block_copy_min = std::max(n_block_min, binfo.seqlen_k_cache / kBlockN);

for (int n_block = n_block_max - 1; n_block >= n_block_copy_min; n_block--) {

Am I missing something here?

@SimpleTheoryOfTypes
Copy link

SimpleTheoryOfTypes commented Nov 20, 2024

The code is correct: seqlen_k is the total kv cache length. i think in your case, binfo.actual_seqlen_k should be strictly less than seqlen_k if there are new tokens to be appended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants