Skip to content

Conversation

chang-wenbin
Copy link
Collaborator

  1. Support cache initialization of MLA backend to rationalize the allocation of kvcache video memory, blocknum from 1500->4500, concurrency from 45->145.
  2. Fixed a bug in v1-schedule that caused the number of activated tokens to exceed max-num-batched-tokens.

Copy link

paddle-bot bot commented Sep 29, 2025

Thanks for your contribution!

# To rationalize the allocation of kvcache.
from fastdeploy import envs

self.mla_cache = envs.FD_ATTENTION_BACKEND == "MLA_ATTN"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是用 MLA 的模型自动设置此环境变量,还是需要手动设置?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前是启动脚本手动设置 export FD_ATTENTION_BACKEND="MLA_ATTN",
后面会根据config.json中的model_type 自动设置backend,这项修改计划和mla默认开启tensor_core一起提交。

Copy link
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KVCache 创建后续需要放到 Attention Backend 里处理

@chang-wenbin chang-wenbin merged commit 48fd5d7 into PaddlePaddle:develop Oct 9, 2025
34 of 41 checks passed
@chang-wenbin chang-wenbin changed the title Support MLA_CACHE & Fix V1_Schedule Bug 【Inference Optimize】Support MLA_CACHE & Fix V1_Schedule Bug Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants