[Cherry-pick] Fix profile run in pd-disaggregated deployment #4693

liyonghua0910 · 2025-10-30T08:57:02Z

Motivation

修复当前代码在 PD 分离场景 profile run 跑不通的问题 (cherry-picked from #4584)

Modifications

重新梳理 cache service (cache transfer + cache messager 的统称) 和 worker 的启动顺序，当前 engine.start() 有两种组件的启动顺序：

对于不指定 block 数量的场景，先启动 worker，让 worker 加载权重、跑 profile run 以确定实际的 block 数量，再启动 cache service。然后由 worker 或 cache service 其中一方创建 cache ，另一方读取 cache；
- 对于 PD 分离场景，由于 cache 需要通过 cache messager 传输，因此实际的 cache 需要在 cache messager 中管理。需要 cache messager 创建 cache 并通过 set_data_ipc 设置共享内存指针，cache_transfer 和 worker 在等待 cache 被创建之后 (cache_ready_signal=1) 通过 share_external_data 读取 cache；
- 对于集中式部署场景，cache 不需要跨机传输，因此不需要 cache messager。另一方面，需要支持通过 /clear_load_weight 接口一并清除 weights & cache，需要复用 paddle.empty_cache 操作，因此将 cache 放到 worker 中管理。需要 worker 创建 cache 并通过 set_data_ipc 设置共享内存指针，cache transfer 在等待 cache 被创建之后通过 share_external_data 读取 cache。
对于指定 block 数量的场景，先启动 cache service 直接创建 cache，再启动 worker 读取 cache；
对于禁用 prefix cache 的集中式部署场景，不需要启动 cache service。

各种场景的启动顺序逻辑真值表如下：

Profile	Mixed	PrefixCache	Cache -> Worker	Worker -> Cache
1	1	1	0	1
1	1	0	0	0
1	0	1	0	1
1	0	0	0	1
0	1	1	0	1
0	1	0	0	0
0	0	1	1	0
0	0	0	1	0

Usage or Command

python yiyanadapter/api_server.py \
       --port "9600" \
       --metrics-port "3113" \
       --model "$MODEL_PATH" \
       --engine-worker-queue-port "1477,1478,1479,1480,1481,1482,1483,1484" \
       --pd-comm-port "9800,9801,9802,9803,9804,9805,9806,9807" \
       --tensor-parallel-size 4 \
       --data-parallel-size 16 \
       --max-model-len 65536 \
       --enable-expert-parallel \
       --enable-chunked-prefill \
       --max-num-seqs 5 \
       --workers 1 \
       --scheduler-ttl 9000 \
       --scheduler-topic "test"  \
       --scheduler-host XXXX \
       --scheduler-port 6379 \
       --scheduler-password XXXX \
       --disable-custom-all-reduce \
       --splitwise-role "prefill" \
       --scheduler-name "dp" \
       --gpu-memory-utilization 0.9 \
       --quantization block_wise_fp8 \
       --max-num-batched-tokens 2048 \
       --cache-transfer-protocol "rdma" \
       --rdma-comm-ports "8501,8502,8503,8504,8505,8506,8507,8508" \
       --cache-queue-port "9978" \
       --ips $ip_list \
       --graph-optimization-config '{"use_cudagraph":false,"use_unique_memory_pool":true,"cudagraph_capture_sizes":[1]}' > log/console_fd.log 2>&1 &

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-10-30T08:57:07Z

Thanks for your contribution!

liyonghua0910 added 2 commits October 30, 2025 16:55

[fix] fix pd+dp+ep bug

9ad4dd8

[fix] fix again

faf1a93

liyonghua0910 changed the title ~~[Cherry-pick] Fix profile run in pd-disaggregated deployment https://github.com/PaddlePaddle/FastDeploy/pull/4584~~ [Cherry-pick] Fix profile run in pd-disaggregated deployment Oct 30, 2025

[ci] fix code style

0f4f5d2

yuanlehome approved these changes Oct 30, 2025

View reviewed changes

Jiang-Jia-Jun added the skip-ci: coverage label Oct 31, 2025

Jiang-Jia-Jun merged commit 9cf4005 into PaddlePaddle:release/2.3 Oct 31, 2025
42 of 49 checks passed

This was referenced Nov 3, 2025

[BugFix] Fix step_shm_value in PD disaggregated deployment #4780

Merged

[Cherry-Pick] [BugFix] Fix step_shm_value in PD disaggregated deployment #4781

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Cherry-pick] Fix profile run in pd-disaggregated deployment #4693

[Cherry-pick] Fix profile run in pd-disaggregated deployment #4693

Uh oh!

liyonghua0910 commented Oct 30, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Cherry-pick] Fix profile run in pd-disaggregated deployment #4693

[Cherry-pick] Fix profile run in pd-disaggregated deployment #4693

Uh oh!

Conversation

liyonghua0910 commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liyonghua0910 commented Oct 30, 2025 •

edited

Loading