Skip to content

[XPU] fix_same_req_id#8040

Open
cmcamdy wants to merge 1 commit into
PaddlePaddle:developfrom
cmcamdy:fix_same_req_id
Open

[XPU] fix_same_req_id#8040
cmcamdy wants to merge 1 commit into
PaddlePaddle:developfrom
cmcamdy:fix_same_req_id

Conversation

@cmcamdy

@cmcamdy cmcamdy commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@codecov-commenter

codecov-commenter commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@fab344e). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/common_engine.py 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8040   +/-   ##
==========================================
  Coverage           ?   67.72%           
==========================================
  Files              ?      471           
  Lines              ?    66361           
  Branches           ?    10217           
==========================================
  Hits               ?    44946           
  Misses             ?    18546           
  Partials           ?     2869           
Flag Coverage Δ
GPU 77.79% <80.00%> (?)
XPU 6.99% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-11 17:13:21

📋 Review 摘要

PR 概述:在 PD decode 预分配资源时新增重复 request_id 拒绝逻辑,并保留 D 侧返回的错误原因。
变更范围fastdeploy/engine/common_engine.pyfastdeploy/engine/sched/resource_manager_v1.py
影响面 Tag[Engine] [Scheduler] [PD Disaggregation]

问题

级别 文件 概述
🔴 Bug fastdeploy/engine/sched/resource_manager_v1.py:1596 cache-task 模式下重复 request_id 被当作资源不足重试,P/D 会永久等待

📝 PR 规范检查

标题 Tag 使用 [XPU],但本次 diff 修改的是 Engine/Scheduler 的 PD decode 资源预分配逻辑,未触及 XPU 专用 worker/model_runner/ops;PR 描述仍是模板占位内容,缺少具体 Motivation/Modifications/Usage/Accuracy Tests 内容。建议替换为以下完整内容。

标题建议(可直接复制):

  • [PD Disaggregation] Fix duplicate request id handling in decode preallocation
PR 描述建议(点击展开,可直接复制)
## Motivation
修复 P/D 分离场景下 Decode 侧收到重复 request_id 时可能复用或污染已有 KV cache 的问题。

## Modifications
- `fastdeploy/engine/sched/resource_manager_v1.py`: 在 Decode 侧资源预分配时检测 `request_id` 是否已存在于 `self.requests`,重复时设置错误信息并拒绝分配。
- `fastdeploy/engine/common_engine.py`: 在资源预分配失败回传给 Prefill 时保留 Decode 侧已经设置的错误原因,避免统一覆盖为 `Not enough resources`## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

修复方向能避免 D 侧同一 request_id 复用已有 block,但当前永久失败和临时资源不足共用 False,会在 cache-task 模式下让重复请求卡住。需要先拆分失败语义,或在已有 error_msg 时回传错误并移除队列。

Comment thread fastdeploy/engine/sched/resource_manager_v1.py
@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 13, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-24 20:34:30 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: cb98744 | Merge base: fab344e (branch: develop)


1 Required任务 : 9/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 37 4 0 0 0
任务 错误类型 置信度 日志
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage PR问题:增量覆盖率低于 80% Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)

错误类型: PR问题 | 置信度: 高
分析器: 通用分析(fallback)
失败用例: CI 日志未提取到具体 pytest 失败用例;本次失败码指向覆盖率门禁。

用例 错误摘要
无具体失败用例 run_tests_with_coverage 返回 exit code 9;workflow 中该码由 diff-cover --fail-under=80 失败设置

关键日志:

[FAILURE]: Process completed with exit code 9.
.github/workflows/_unit_test_coverage.yml: diff-cover python_coverage_all.xml --diff-file=diff.txt --fail-under=80 ... || COVERAGE_EXIT_CODE=9
.github/workflows/_unit_test_coverage.yml: exit "$COVERAGE_EXIT_CODE"
  • 根因摘要: 新增分支缺少增量覆盖率

PR 本次新增 resource_manager_v1.py 的重复 request_id 拒绝分支,以及 common_engine.py 中保留已有 error_msg 的分支。run_tests_with_coverage 的 exit code 9 在 workflow 中对应增量覆盖率未达到 80%;现有相关测试只覆盖 preallocate_resource_in_d 成功路径,未覆盖重复 request_id 返回 False 并设置 error_msg 的新分支。深度日志未返回 diff_coverage.json,因此未能列出精确未覆盖行,但未覆盖风险集中在本 PR 的新增行。

修复建议:

  1. tests/v1/test_resource_manager_v1.py 增加重复 request_id 分支用例:预置 manager_d.requests[request_id],断言 preallocate_resource_in_d 返回 Falserequest.error_msg == "Duplicate request id in decode"
  2. tests/engine/test_common_engine.py 或 splitwise 相关测试中覆盖 decode 预分配失败且 error_msg 已存在时,common_engine.py:2128 不覆盖为 "Not enough resources",并传给 send_cache_info_to_prefill

关联变更: fastdeploy/engine/sched/resource_manager_v1.py:1590, fastdeploy/engine/common_engine.py:2128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants