Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill by ZhangX-21 · Pull Request #7969 · PaddlePaddle/FastDeploy

ZhangX-21 · 2026-06-02T05:22:53Z

Motivation

This PR reverts the previous blockwise CUDAGraph implementation and adds support for piecewise CUDAGraph in the prefill phase.

Modifications

Revert blockwise CUDAGraph related logic.
Support piecewise CUDAGraph for prefill.
Capture reusable graph segments inside the prefill phase.
Refactor prefill CUDAGraph capture/replay control flow.
Keep decode CUDAGraph behavior unchanged.

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

PaddlePaddle-bot · 2026-06-02T05:37:21Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-08 13:05:54 UTC+08:00

CI报告基于以下代码生成（30分钟更新一次）:
PR commit: 342bf2d | Merge base: 4474188 (branch: develop)

1 Required任务 : 7/10 通过

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
42(0)	42	35	7	0	0	0

任务	错误类型	置信度	日志
`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	PR问题	高	Job
`Approval`	需要 Approval	高	Job
`Run Four Cards Tests / run_4_cards_tests`	PR问题	中	Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题（置信度: 高）

失败用例:

用例	错误摘要
`tests/model_executor/test_ep.py::test_eprunner_moe_select_noaux_tc_without_redundant`	`TypeError`: `scores + e_score_correction_bias`，`e_score_correction_bias` 为 None

关键日志:

>       scores_with_bias = scores + e_score_correction_bias
E       TypeError: (InvalidType) __add__(): argument (position 1) must be int, float, bool or Tensor, but got NoneType

fastdeploy/model_executor/layers/moe/moe.py:118: TypeError

根因摘要: PR 删除了 assert e_score_correction_bias is not None，但未处理 None 时的下游逻辑

PR 在 moe.py:106 删除了断言 assert e_score_correction_bias is not None，使得 None 可以传入后续代码。当 expert_id_to_ep_rank_array is None and not use_fused_cast 时，代码直接执行 scores + e_score_correction_bias，而测试用例中 gate_correction_bias=None，触发 TypeError。

修复建议:

在 fastdeploy/model_executor/layers/moe/moe.py 第 118 行附近，对 e_score_correction_bias 为 None 做条件判断：scores_with_bias = scores + e_score_correction_bias if e_score_correction_bias is not None else scores
或在 ep.py 的 moe_select 调用处确保 e_score_correction_bias 不为 None
同步更新测试用例，验证 gate_correction_bias=None 时的正确行为

关联变更:

fastdeploy/model_executor/layers/moe/moe.py:106: 删除 assert e_score_correction_bias is not None（直接触发失败）
fastdeploy/model_executor/layers/moe/ep.py: 将 get_moe_scores import 提到模块级

🔴 Approval — 需要 Approval（置信度: 高）

该 Job 需要人工 Approval，完成审批后 CI 才会继续执行。

🔴 Run Four Cards Tests / run_4_cards_tests — PR问题（置信度: 中）

失败用例:

用例	错误摘要
`test_GLM_45_AIR_mtp_tp4.py::test_r3_accuracy`	`EOFError: Ran out of input`，paddle.io.load 读取到空/截断的 pickle 文件

关键日志:

tests/e2e/4cards_cases/test_GLM_45_AIR_mtp_tp4.py:208: in test_r3_accuracy
tests/e2e/utils/rollout_routing_replay_test_utils.py:185: in check_routing_replay_chat_completion
paddle/framework/io.py:1275: in load
paddle/framework/restricted_unpickler.py:227: in safe_load_pickle
E   EOFError: Ran out of input
FAILED tests/e2e/4cards_cases/test_GLM_45_AIR_mtp_tp4.py::test_r3_accuracy
==================== 1 failed, 1 passed in 77.84s (0:01:17) ====================

根因摘要: prefill CUDAGraph capture 阶段 routing replay pickle 文件写入为空

PR 新增 @prefill_cudagraph_guard(True) 装饰 capture_model_prefill_and_mixed，同时引入全局 in_prefill_cudagraph_mode guard。test_r3_accuracy 通过 routing replay 机制验证推理精度，该机制依赖将路由数据序列化写入 pickle 文件。如果 routing replay 保存逻辑在 in_prefill_cudagraph_mode 激活期间有条件跳过写文件，或 piecewise CUDAGraph 对 prefill 路径的重构导致路由数据未持久化，均会造成 pickle 文件为空，从而触发 EOFError: Ran out of input。

修复建议:

检查 rollout_routing_replay_test_utils.py 及 routing replay manager 中是否对 in_prefill_cudagraph_mode 有条件判断，确保 prefill CUDAGraph capture 阶段路由数据仍可正确写入
排查 capture_model_prefill_and_mixed 中 piecewise CUDAGraph capture 与 routing replay 保存机制是否存在冲突（CUDAGraph capture 期间 CPU 侧文件 I/O 是否被阻断）
本地复现：运行 4 卡 GLM-4.5-AIR-MTP-TP4 推理服务后检查生成的 routing replay pickle 文件是否为空

关联变更:

fastdeploy/worker/gpu_model_runner.py: capture_model_prefill_and_mixed 添加 @prefill_cudagraph_guard(True)
fastdeploy/model_executor/graph_optimization/utils.py: 新增 prefill_cudagraph_guard, in_prefill_cudagraph_mode

codecov-commenter · 2026-06-02T06:09:25Z

Codecov Report

❌ Patch coverage is 62.50000% with 6 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@4474188). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/model_executor/models/glm4_moe.py	0.00%	3 Missing ⚠️
fastdeploy/model_executor/layers/normalization.py	71.42%	1 Missing and 1 partial ⚠️
...tdeploy/model_executor/graph_optimization/utils.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7969   +/-   ##
==========================================
  Coverage           ?   67.45%           
==========================================
  Files              ?      466           
  Lines              ?    65196           
  Branches           ?    10015           
==========================================
  Hits               ?    43976           
  Misses             ?    18382           
  Partials           ?     2838

Flag	Coverage Δ
GPU	`77.68% <66.66%> (?)`
XPU	`7.10% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-06-23 16:56:35

📋 Review 摘要

PR 概述：回滚 block-wise CUDAGraph，并为 PD prefill worker 接入 piecewise CUDAGraph 捕获路径。
变更范围：GraphOptimizationConfig/CUDAGraph backend、GPU worker warmup、MoE/Attention/GLM4_MoE 相关调用、block-wise graph 代码和测试删除。
影响面 Tag：[FDConfig] [Graph Optimization] [Executor] [OP] [Models]

问题

级别	文件	概述
🔴 Bug	`fastdeploy/config.py:2237`	CPU cache/H2D swap 场景仍会强制开启 piecewise prefill CUDAGraph，绕过后续兼容性禁用

历史 Findings 修复情况

Finding	问题	状态
F1	`isinstance` 分支两侧代码完全相同	⚠️ 仍存在
F2	注释掉的断言代码建议删除或用 TODO 说明保留原因	✅ 已修复
F3	`in_prefill_cudagraph_mode` 目前无消费方	⚠️ 仍存在

📝 PR 规范检查

标题缺少官方 Tag，描述结构完整但 Usage or Command 和 Accuracy Tests 仅为占位注释。

标题建议（可直接复制）：

[Graph Optimization] Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill

PR 描述建议（点击展开，可直接复制）

## Motivation

This PR reverts the previous blockwise CUDAGraph implementation and adds support for piecewise CUDAGraph in the prefill phase. The blockwise approach captured per-layer graphs which fragmented SOT-compiled graphs; the piecewise approach captures reusable sub-graph segments during prefill without graph fragmentation.

## Modifications

- Revert blockwise CUDAGraph related logic (remove `cuda_graph_op.py`, env vars `FD_USE_BLOCK_WISE_CUDA_GRAPH` / `FD_BLOCK_WISE_CUDA_GRAPH_SIZES`).
- Add `prefill_cudagraph_guard` to skip block-wise wrappers during prefill capture.
- Extend prefill capture sizes up to 8192 tokens in `config.py`.
- Refactor `RMSNorm.forward`: remove dtype cast, use dynamic `max_chunk_tokens` for allreduce fusion.
- Pass `max_token_num` to `flashinfer_allreduce_residual_rmsnorm` to match workspace allocation.
- Keep decode CUDAGraph behavior unchanged.

## Usage or Command

N/A

## Accuracy Tests

N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

本轮按风险优先覆盖了 CUDAGraph/prefill/PD warmup 链路、Config 联动、MoE/Attention/Model 变更和历史 findings。piecewise prefill 的主流程方向清晰，但需要先补上 CPU cache/H2D swap 的禁用联动，否则会在明确不兼容的场景继续 capture prefill graph。

PaddlePaddle-bot · 2026-06-23T08:58:35Z

+            # reuse the mixed piecewise path (capture_model_prefill_and_mixed) for the prefill worker.
+            # Otherwise fall back to cudagraph_only_prefill flag (legacy path).
+            if self.graph_opt_config.graph_opt_level >= 1 and not self.graph_opt_config.full_cuda_graph:
+                self.graph_opt_config.use_cudagraph = True


🔴 Bug 这里在 PD prefill piecewise 模式下无条件把 use_cudagraph 置为 True，但后面的 CPU cache 兼容性逻辑只检查 cudagraph_only_prefill。

当 cache_config.num_cpu_blocks 非零时，下面的注释已经说明 layer-by-layer swap/H2D 与 CUDA Graph prefill capture 不兼容；但本分支默认 cudagraph_only_prefill=False，所以会绕过 2246-2259 的禁用逻辑。随后 gpu_worker.graph_optimize_and_warm_up_model() 在 use_piecewise and not is_pd_decode 下仍会调用 capture_model_prefill_and_mixed()，启用本应被禁用的 prefill CUDAGraph。

建议修复方式：把 piecewise prefill 也纳入同一兼容性判断；例如在 num_cpu_blocks 非零且 splitwise_role == "prefill" and graph_opt_level >= 1 and not full_cuda_graph 时不要打开 use_cudagraph，或在后面的 CPU cache 兼容性块里同时将 use_cudagraph 复位为 False 并跳过 prefill capture。

ZhangX-21 had a problem deploying to Metax_ci June 2, 2026 05:22 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill

342bf2d

ZhangX-21 force-pushed the piecewise_cudagraph branch from 29f8a1e to 342bf2d Compare June 2, 2026 08:17

ZhangX-21 had a problem deploying to Metax_ci June 2, 2026 08:17 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Support piecewise CUDAGraph for MTP execution

21a35d9

ZhangX-21 had a problem deploying to Metax_ci June 9, 2026 03:15 — with GitHub Actions Failure

support PD separate P worker piecewise cudagraph

7bee9e2

ZhangX-21 had a problem deploying to Metax_ci June 23, 2026 08:43 — with GitHub Actions Failure

PaddlePaddle-bot suggested changes Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill#7969

Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill#7969
ZhangX-21 wants to merge 3 commits into
PaddlePaddle:developfrom
ZhangX-21:piecewise_cudagraph

ZhangX-21 commented Jun 2, 2026

Uh oh!

PaddlePaddle-bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Jun 2, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ZhangX-21 commented Jun 2, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

PaddlePaddle-bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 Required任务 : 7/10 通过

2 失败详情

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

历史 Findings 修复情况

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PaddlePaddle-bot commented Jun 2, 2026 •

edited

Loading

codecov-commenter commented Jun 2, 2026 •

edited

Loading