Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -620,9 +620,13 @@ def get_kv_cache_shape(
"""
Calculate kv cache shape for MLA
"""
layer_id = self.layer_id
layer_id = getattr(self, "layer_id", None)
value_cache_shape = []
if self.window_attn_skip_freq is not None and self.window_attn_skip_freq[layer_id] == 1:
if (
layer_id is not None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug layer_id is None 时直接跳过 window_attn_skip_freq 会把含 SWA/窗口层的 MLA cache shape 降成普通 MLA shape。

PrefixCacheManager._get_kv_cache_shape 会临时构造 backend 后立即调用 get_kv_cache_shape,这类实例没有 layer_id;V1 CacheController.initialize_kv_cache 也会在旧路径逐层设置 attn_backend.layer_id 之前调用 create_kv_cache。当前分支会返回 kv_lora_rank + qk_rope_head_dim 的普通 shape,但 runner 对 window_attn_skip_freq[i] == 1 的层使用 kv_lora_rank + 4 * (kv_lora_rank // 128) + 2 * qk_rope_head_dim 并按 uint8 cache attach/创建。这样 PR 虽然避免了 AttributeError,却会给窗口层分配或传递偏小的 IPC/cache shape,后续 attach 或 attention kernel 可能出现 shape mismatch 或越界读写。

建议修复方式:
让无 layer_id 的调用不要默认走普通 MLA shape。可以让 cache manager/cache controller 在调用前按层设置/传入 layer_id 并逐层取 shape;如果该接口必须返回单一 shape,则在 layer_id is None and any(window_attn_skip_freq) 时返回覆盖所有层的保守最大 shape,并同步 dtype/IPC 分配策略。

and self.window_attn_skip_freq is not None
and self.window_attn_skip_freq[layer_id] == 1
):
Comment on lines +625 to +629
fp8_key_cahe_dim = self.kv_lora_rank + 4 * (self.kv_lora_rank // 128) + 2 * self.qk_rope_head_dim
key_cache_shape = [max_num_blocks, 1, self.block_size, fp8_key_cahe_dim]
else:
Expand Down