Skip to content

Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill#7969

Open
ZhangX-21 wants to merge 3 commits into
PaddlePaddle:developfrom
ZhangX-21:piecewise_cudagraph
Open

Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill#7969
ZhangX-21 wants to merge 3 commits into
PaddlePaddle:developfrom
ZhangX-21:piecewise_cudagraph

Conversation

@ZhangX-21

Copy link
Copy Markdown
Contributor

Motivation

This PR reverts the previous blockwise CUDAGraph implementation and adds support for piecewise CUDAGraph in the prefill phase.

Modifications

  • Revert blockwise CUDAGraph related logic.
  • Support piecewise CUDAGraph for prefill.
  • Capture reusable graph segments inside the prefill phase.
  • Refactor prefill CUDAGraph capture/replay control flow.
  • Keep decode CUDAGraph behavior unchanged.

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 2, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-08 13:05:54 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: 342bf2d | Merge base: 4474188 (branch: develop)


1 Required任务 : 7/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 35 7 0 0 0
任务 错误类型 置信度 日志
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage PR问题 Job
Approval 需要 Approval Job
Run Four Cards Tests / run_4_cards_tests PR问题 Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)

失败用例:

用例 错误摘要
tests/model_executor/test_ep.py::test_eprunner_moe_select_noaux_tc_without_redundant TypeError: scores + e_score_correction_biase_score_correction_bias 为 None

关键日志:

>       scores_with_bias = scores + e_score_correction_bias
E       TypeError: (InvalidType) __add__(): argument (position 1) must be int, float, bool or Tensor, but got NoneType

fastdeploy/model_executor/layers/moe/moe.py:118: TypeError
  • 根因摘要: PR 删除了 assert e_score_correction_bias is not None,但未处理 None 时的下游逻辑

PR 在 moe.py:106 删除了断言 assert e_score_correction_bias is not None,使得 None 可以传入后续代码。当 expert_id_to_ep_rank_array is None and not use_fused_cast 时,代码直接执行 scores + e_score_correction_bias,而测试用例中 gate_correction_bias=None,触发 TypeError

修复建议:

  1. fastdeploy/model_executor/layers/moe/moe.py 第 118 行附近,对 e_score_correction_bias 为 None 做条件判断:scores_with_bias = scores + e_score_correction_bias if e_score_correction_bias is not None else scores
  2. 或在 ep.pymoe_select 调用处确保 e_score_correction_bias 不为 None
  3. 同步更新测试用例,验证 gate_correction_bias=None 时的正确行为

关联变更:

  • fastdeploy/model_executor/layers/moe/moe.py:106: 删除 assert e_score_correction_bias is not None(直接触发失败)
  • fastdeploy/model_executor/layers/moe/ep.py: 将 get_moe_scores import 提到模块级
🔴 Approval — 需要 Approval(置信度: 高)

该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

🔴 Run Four Cards Tests / run_4_cards_tests — PR问题(置信度: 中)

失败用例:

用例 错误摘要
test_GLM_45_AIR_mtp_tp4.py::test_r3_accuracy EOFError: Ran out of input,paddle.io.load 读取到空/截断的 pickle 文件

关键日志:

tests/e2e/4cards_cases/test_GLM_45_AIR_mtp_tp4.py:208: in test_r3_accuracy
tests/e2e/utils/rollout_routing_replay_test_utils.py:185: in check_routing_replay_chat_completion
paddle/framework/io.py:1275: in load
paddle/framework/restricted_unpickler.py:227: in safe_load_pickle
E   EOFError: Ran out of input
FAILED tests/e2e/4cards_cases/test_GLM_45_AIR_mtp_tp4.py::test_r3_accuracy
==================== 1 failed, 1 passed in 77.84s (0:01:17) ====================
  • 根因摘要: prefill CUDAGraph capture 阶段 routing replay pickle 文件写入为空

PR 新增 @prefill_cudagraph_guard(True) 装饰 capture_model_prefill_and_mixed,同时引入全局 in_prefill_cudagraph_mode guard。test_r3_accuracy 通过 routing replay 机制验证推理精度,该机制依赖将路由数据序列化写入 pickle 文件。如果 routing replay 保存逻辑在 in_prefill_cudagraph_mode 激活期间有条件跳过写文件,或 piecewise CUDAGraph 对 prefill 路径的重构导致路由数据未持久化,均会造成 pickle 文件为空,从而触发 EOFError: Ran out of input

修复建议:

  1. 检查 rollout_routing_replay_test_utils.py 及 routing replay manager 中是否对 in_prefill_cudagraph_mode 有条件判断,确保 prefill CUDAGraph capture 阶段路由数据仍可正确写入
  2. 排查 capture_model_prefill_and_mixed 中 piecewise CUDAGraph capture 与 routing replay 保存机制是否存在冲突(CUDAGraph capture 期间 CPU 侧文件 I/O 是否被阻断)
  3. 本地复现:运行 4 卡 GLM-4.5-AIR-MTP-TP4 推理服务后检查生成的 routing replay pickle 文件是否为空

关联变更:

  • fastdeploy/worker/gpu_model_runner.py: capture_model_prefill_and_mixed 添加 @prefill_cudagraph_guard(True)
  • fastdeploy/model_executor/graph_optimization/utils.py: 新增 prefill_cudagraph_guard, in_prefill_cudagraph_mode

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter

codecov-commenter commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 62.50000% with 6 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@4474188). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/model_executor/models/glm4_moe.py 0.00% 3 Missing ⚠️
fastdeploy/model_executor/layers/normalization.py 71.42% 1 Missing and 1 partial ⚠️
...tdeploy/model_executor/graph_optimization/utils.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7969   +/-   ##
==========================================
  Coverage           ?   67.45%           
==========================================
  Files              ?      466           
  Lines              ?    65196           
  Branches           ?    10015           
==========================================
  Hits               ?    43976           
  Misses             ?    18382           
  Partials           ?     2838           
Flag Coverage Δ
GPU 77.68% <66.66%> (?)
XPU 7.10% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-23 16:56:35

📋 Review 摘要

PR 概述:回滚 block-wise CUDAGraph,并为 PD prefill worker 接入 piecewise CUDAGraph 捕获路径。
变更范围:GraphOptimizationConfig/CUDAGraph backend、GPU worker warmup、MoE/Attention/GLM4_MoE 相关调用、block-wise graph 代码和测试删除。
影响面 Tag[FDConfig] [Graph Optimization] [Executor] [OP] [Models]

问题

级别 文件 概述
🔴 Bug fastdeploy/config.py:2237 CPU cache/H2D swap 场景仍会强制开启 piecewise prefill CUDAGraph,绕过后续兼容性禁用

历史 Findings 修复情况

Finding 问题 状态
F1 isinstance 分支两侧代码完全相同 ⚠️ 仍存在
F2 注释掉的断言代码建议删除或用 TODO 说明保留原因 ✅ 已修复
F3 in_prefill_cudagraph_mode 目前无消费方 ⚠️ 仍存在

📝 PR 规范检查

标题缺少官方 Tag,描述结构完整但 Usage or Command 和 Accuracy Tests 仅为占位注释。

标题建议(可直接复制):

  • [Graph Optimization] Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill
PR 描述建议(点击展开,可直接复制)
## Motivation

This PR reverts the previous blockwise CUDAGraph implementation and adds support for piecewise CUDAGraph in the prefill phase. The blockwise approach captured per-layer graphs which fragmented SOT-compiled graphs; the piecewise approach captures reusable sub-graph segments during prefill without graph fragmentation.

## Modifications

- Revert blockwise CUDAGraph related logic (remove `cuda_graph_op.py`, env vars `FD_USE_BLOCK_WISE_CUDA_GRAPH` / `FD_BLOCK_WISE_CUDA_GRAPH_SIZES`).
- Add `prefill_cudagraph_guard` to skip block-wise wrappers during prefill capture.
- Extend prefill capture sizes up to 8192 tokens in `config.py`.
- Refactor `RMSNorm.forward`: remove dtype cast, use dynamic `max_chunk_tokens` for allreduce fusion.
- Pass `max_token_num` to `flashinfer_allreduce_residual_rmsnorm` to match workspace allocation.
- Keep decode CUDAGraph behavior unchanged.

## Usage or Command

N/A

## Accuracy Tests

N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

本轮按风险优先覆盖了 CUDAGraph/prefill/PD warmup 链路、Config 联动、MoE/Attention/Model 变更和历史 findings。piecewise prefill 的主流程方向清晰,但需要先补上 CPU cache/H2D swap 的禁用联动,否则会在明确不兼容的场景继续 capture prefill graph。

Comment thread fastdeploy/config.py
# reuse the mixed piecewise path (capture_model_prefill_and_mixed) for the prefill worker.
# Otherwise fall back to cudagraph_only_prefill flag (legacy path).
if self.graph_opt_config.graph_opt_level >= 1 and not self.graph_opt_config.full_cuda_graph:
self.graph_opt_config.use_cudagraph = True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 这里在 PD prefill piecewise 模式下无条件把 use_cudagraph 置为 True,但后面的 CPU cache 兼容性逻辑只检查 cudagraph_only_prefill

cache_config.num_cpu_blocks 非零时,下面的注释已经说明 layer-by-layer swap/H2D 与 CUDA Graph prefill capture 不兼容;但本分支默认 cudagraph_only_prefill=False,所以会绕过 2246-2259 的禁用逻辑。随后 gpu_worker.graph_optimize_and_warm_up_model()use_piecewise and not is_pd_decode 下仍会调用 capture_model_prefill_and_mixed(),启用本应被禁用的 prefill CUDAGraph。

建议修复方式:把 piecewise prefill 也纳入同一兼容性判断;例如在 num_cpu_blocks 非零且 splitwise_role == "prefill" and graph_opt_level >= 1 and not full_cuda_graph 时不要打开 use_cudagraph,或在后面的 CPU cache 兼容性块里同时将 use_cudagraph 复位为 False 并跳过 prefill capture。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants