fix(python/sglang): FlashInfer kv_indptr issue#2
Closed
hxkwan wants to merge 1 commit intoOpenBMB:minicpm_salafrom
Closed
fix(python/sglang): FlashInfer kv_indptr issue#2hxkwan wants to merge 1 commit intoOpenBMB:minicpm_salafrom
hxkwan wants to merge 1 commit intoOpenBMB:minicpm_salafrom
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
FlashInfer kv_indptr 缓冲区污染导致 CUDA Graph Replay 崩溃
1. 错误现象
使用
--dense-as-sparse --attention-backend minicpm_flashinfer部署模型,推理运行一段时间后,Scheduler 在 CUDA Graph Replay 阶段抛出异常:
FlashInfer 要求
kv_indptr数组是单调非递减的(作为 CSR 格式的索引指针),但实际传入的值在位置 8 处为 36606,位置 9 处为 34891,出现了递减(差值 -1715),违反了约束。
2. 错误根因
2.1 背景:共享缓冲区设计
在
init_cuda_graph_state()中,系统预计算了一个flashinfer_kv_indptr缓冲区,其值为
[0, K, 2K, 3K, ..., (max_bs*2)*K](K =num_sparse_topk_tokens),存储在
self.decode_cuda_graph_metadata["flashinfer_kv_indptr"]中。CUDA Graph 捕获时,FlashInfer 的
BatchDecodeWithPagedKVCacheWrapper通过paged_kv_indptr_buffer=kv_indptr_view参数将该缓冲区的切片作为内部缓冲区wrapper._paged_kv_indptr_buf。两者指向同一块 GPU 内存。2.2 污染链条
具体代码路径:
replay_prepare (
init_forward_metadata_replay_cuda_graph, line ~1774):调用
wrapper.begin_forward(kv_indptr_view, ...)读取预计算缓冲区CUDA Graph replay (
FlashInferKernel.forward, line ~344):该函数根据每个请求实际的 sparse page table 重新计算 kv_indptr,
覆写后的值取决于各请求的实际 token 分布,不再是等差
[0, K, 2K, ...]。下一轮 replay_prepare 再次读取同一缓冲区时,看到的是上一轮被动态覆写的值,
这些值可能非单调递增,导致 FlashInfer 的
plan()函数校验失败。2.3 为什么"用了一会儿之后"才报错
begin_forward()通过校验convert_sparse_page_table_to_flashinfer()污染缓冲区begin_forward()才看到被污染的值只要不同请求的 topk 页面分布产生了非单调的 cumsum 就会触发
3. 修改方法
核心思路
保存一份预计算值的只读副本(
flashinfer_kv_indptr_original),在每次
wrapper.begin_forward()调用前,从副本恢复缓冲区的预计算值。修改文件
python/sglang/srt/layers/attention/minicpm_backend.py(共 3 处修改)修改详情
修改 1:
init_cuda_graph_state()— 保存只读副本新增一个
.clone()副本,该副本不会被任何 wrapper 引用,因此不会被 CUDA Graph中的
convert_sparse_page_table_to_flashinfer()覆写。修改 2:
init_forward_metadata_capture_cuda_graph()— 捕获前恢复多个 batch size 的 CUDA Graph 按顺序捕获,前一次捕获的 forward 会污染缓冲区,
影响后续捕获。加上恢复可保证每次捕获都使用干净的预计算值。
修改 3:
init_forward_metadata_replay_cuda_graph()— 每次 replay 前恢复恢复后,填充 padding 区域的值改为使用
kv_indptr_view[sparse_real_bs](即
sparse_real_bs * K,恢复后的正确值),而非kv_indptr_view[-1](在未恢复的情况下可能是被污染的值)。