Skip to content

Conversation

@liyonghua0910
Copy link
Collaborator

@liyonghua0910 liyonghua0910 commented Oct 30, 2025

Motivation

修复当前代码在 PD 分离场景 profile run 跑不通的问题 (cherry-picked from #4584)

Modifications

重新梳理 cache service (cache transfer + cache messager 的统称) 和 worker 的启动顺序,当前 engine.start() 有两种组件的启动顺序:

  1. 对于不指定 block 数量的场景,先启动 worker,让 worker 加载权重、跑 profile run 以确定实际的 block 数量,再启动 cache service。然后由 worker 或 cache service 其中一方创建 cache ,另一方读取 cache;

    • 对于 PD 分离场景,由于 cache 需要通过 cache messager 传输,因此实际的 cache 需要在 cache messager 中管理。需要 cache messager 创建 cache 并通过 set_data_ipc 设置共享内存指针,cache_transfer 和 worker 在等待 cache 被创建之后 (cache_ready_signal=1) 通过 share_external_data 读取 cache;
    • 对于集中式部署场景,cache 不需要跨机传输,因此不需要 cache messager。另一方面,需要支持通过 /clear_load_weight 接口一并清除 weights & cache,需要复用 paddle.empty_cache 操作,因此将 cache 放到 worker 中管理。需要 worker 创建 cache 并通过 set_data_ipc 设置共享内存指针,cache transfer 在等待 cache 被创建之后通过 share_external_data 读取 cache。
  2. 对于指定 block 数量的场景,先启动 cache service 直接创建 cache,再启动 worker 读取 cache;

  3. 对于禁用 prefix cache 的集中式部署场景,不需要启动 cache service。

各种场景的启动顺序逻辑真值表如下:

Profile Mixed PrefixCache Cache -> Worker Worker -> Cache
1 1 1 0 1
1 1 0 0 0
1 0 1 0 1
1 0 0 0 1
0 1 1 0 1
0 1 0 0 0
0 0 1 1 0
0 0 0 1 0

Usage or Command

python yiyanadapter/api_server.py \
       --port "9600" \
       --metrics-port "3113" \
       --model "$MODEL_PATH" \
       --engine-worker-queue-port "1477,1478,1479,1480,1481,1482,1483,1484" \
       --pd-comm-port "9800,9801,9802,9803,9804,9805,9806,9807" \
       --tensor-parallel-size 4 \
       --data-parallel-size 16 \
       --max-model-len 65536 \
       --enable-expert-parallel \
       --enable-chunked-prefill \
       --max-num-seqs 5 \
       --workers 1 \
       --scheduler-ttl 9000 \
       --scheduler-topic "test"  \
       --scheduler-host XXXX \
       --scheduler-port 6379 \
       --scheduler-password XXXX \
       --disable-custom-all-reduce \
       --splitwise-role "prefill" \
       --scheduler-name "dp" \
       --gpu-memory-utilization 0.9 \
       --quantization block_wise_fp8 \
       --max-num-batched-tokens 2048 \
       --cache-transfer-protocol "rdma" \
       --rdma-comm-ports "8501,8502,8503,8504,8505,8506,8507,8508" \
       --cache-queue-port "9978" \
       --ips $ip_list \
       --graph-optimization-config '{"use_cudagraph":false,"use_unique_memory_pool":true,"cudagraph_capture_sizes":[1]}' > log/console_fd.log 2>&1 &

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Oct 30, 2025

Thanks for your contribution!

@liyonghua0910 liyonghua0910 changed the title [Cherry-pick] Fix profile run in pd-disaggregated deployment https://github.com/PaddlePaddle/FastDeploy/pull/4584 [Cherry-pick] Fix profile run in pd-disaggregated deployment Oct 30, 2025
@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 9cf4005 into PaddlePaddle:release/2.3 Oct 31, 2025
42 of 49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants