Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill#7969
Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill#7969ZhangX-21 wants to merge 3 commits into
Conversation
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 7/10 通过
2 失败详情🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)失败用例:
关键日志:
PR 在 修复建议:
关联变更:
🔴 Approval — 需要 Approval(置信度: 高)该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。 🔴 Run Four Cards Tests / run_4_cards_tests — PR问题(置信度: 中)失败用例:
关键日志:
PR 新增 修复建议:
关联变更:
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7969 +/- ##
==========================================
Coverage ? 67.45%
==========================================
Files ? 466
Lines ? 65196
Branches ? 10015
==========================================
Hits ? 43976
Misses ? 18382
Partials ? 2838
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
29f8a1e to
342bf2d
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-23 16:56:35
📋 Review 摘要
PR 概述:回滚 block-wise CUDAGraph,并为 PD prefill worker 接入 piecewise CUDAGraph 捕获路径。
变更范围:GraphOptimizationConfig/CUDAGraph backend、GPU worker warmup、MoE/Attention/GLM4_MoE 相关调用、block-wise graph 代码和测试删除。
影响面 Tag:[FDConfig] [Graph Optimization] [Executor] [OP] [Models]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/config.py:2237 |
CPU cache/H2D swap 场景仍会强制开启 piecewise prefill CUDAGraph,绕过后续兼容性禁用 |
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | isinstance 分支两侧代码完全相同 |
|
| F2 | 注释掉的断言代码建议删除或用 TODO 说明保留原因 | ✅ 已修复 |
| F3 | in_prefill_cudagraph_mode 目前无消费方 |
📝 PR 规范检查
标题缺少官方 Tag,描述结构完整但 Usage or Command 和 Accuracy Tests 仅为占位注释。
标题建议(可直接复制):
[Graph Optimization] Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill
PR 描述建议(点击展开,可直接复制)
## Motivation
This PR reverts the previous blockwise CUDAGraph implementation and adds support for piecewise CUDAGraph in the prefill phase. The blockwise approach captured per-layer graphs which fragmented SOT-compiled graphs; the piecewise approach captures reusable sub-graph segments during prefill without graph fragmentation.
## Modifications
- Revert blockwise CUDAGraph related logic (remove `cuda_graph_op.py`, env vars `FD_USE_BLOCK_WISE_CUDA_GRAPH` / `FD_BLOCK_WISE_CUDA_GRAPH_SIZES`).
- Add `prefill_cudagraph_guard` to skip block-wise wrappers during prefill capture.
- Extend prefill capture sizes up to 8192 tokens in `config.py`.
- Refactor `RMSNorm.forward`: remove dtype cast, use dynamic `max_chunk_tokens` for allreduce fusion.
- Pass `max_token_num` to `flashinfer_allreduce_residual_rmsnorm` to match workspace allocation.
- Keep decode CUDAGraph behavior unchanged.
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
本轮按风险优先覆盖了 CUDAGraph/prefill/PD warmup 链路、Config 联动、MoE/Attention/Model 变更和历史 findings。piecewise prefill 的主流程方向清晰,但需要先补上 CPU cache/H2D swap 的禁用联动,否则会在明确不兼容的场景继续 capture prefill graph。
| # reuse the mixed piecewise path (capture_model_prefill_and_mixed) for the prefill worker. | ||
| # Otherwise fall back to cudagraph_only_prefill flag (legacy path). | ||
| if self.graph_opt_config.graph_opt_level >= 1 and not self.graph_opt_config.full_cuda_graph: | ||
| self.graph_opt_config.use_cudagraph = True |
There was a problem hiding this comment.
🔴 Bug 这里在 PD prefill piecewise 模式下无条件把 use_cudagraph 置为 True,但后面的 CPU cache 兼容性逻辑只检查 cudagraph_only_prefill。
当 cache_config.num_cpu_blocks 非零时,下面的注释已经说明 layer-by-layer swap/H2D 与 CUDA Graph prefill capture 不兼容;但本分支默认 cudagraph_only_prefill=False,所以会绕过 2246-2259 的禁用逻辑。随后 gpu_worker.graph_optimize_and_warm_up_model() 在 use_piecewise and not is_pd_decode 下仍会调用 capture_model_prefill_and_mixed(),启用本应被禁用的 prefill CUDAGraph。
建议修复方式:把 piecewise prefill 也纳入同一兼容性判断;例如在 num_cpu_blocks 非零且 splitwise_role == "prefill" and graph_opt_level >= 1 and not full_cuda_graph 时不要打开 use_cudagraph,或在后面的 CPU cache 兼容性块里同时将 use_cudagraph 复位为 False 并跳过 prefill capture。
Motivation
This PR reverts the previous blockwise CUDAGraph implementation and adds support for piecewise CUDAGraph in the prefill phase.
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.