Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Support disaggregated prefill with Mooncake Transfer Engine #10884

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

ShangmingCai
Copy link
Contributor

@ShangmingCai ShangmingCai commented Dec 4, 2024

We really appreciate @KuntaiDu for his remarkable work in supporting the disaggregated prefill feature in vLLM. Since PR #10502 has been merged. After rebase, we switch the mooncake integration from PR #10728 to here.

This PR is related to #10727, as well as a continuation of PR #10502, which uses Mooncake's Transfer Engine for KVCache transfer instead of NCCL.

Mooncake is a KVCache-centric disaggregated architecture for LLM serving. Transfer Engine is the core component of Mooncake, see documentations for its design & API list.

Compared with NCCL, Mooncake Transfer Engine has the following features:

  • a unified programming interface for data transfers between DRAM-to-DRAM (both local and remote), DRAM-to-GPU VRAM (both local and remote), and DRAM-to-remote NVMe devices
  • support for TCP, RDMA, and NVMe-of protocols
  • topology-aware path selection (link to our English doc, transfer_engine.md), aggregating bandwidth from multiple NICs

Like the current implementation of PR #10502, there are two roles: KV provider (e.g. prefill vLLM instance) and KV consumer (e.g. decode vLLM instance)

  • Provider side implements insert: insert a KV cache into a buffer, so that it can be transferred upon request
  • Consumer side implements drop_select: select a KV cache based on tokens, transfer the selected KV, and drop this KV out from the buffer

Both roles are run on different machines.

Integration guide: https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm-integration-v0.2-nightly.md

Benchmark result: https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm_benchmark_results.md New benchmark results will be added soon.

Test files will be added to align with the future test CI pipeline for PR #10502.

CC List.
@KuntaiDu @youkaichao @alogfans @stmatengss @james0zan

Copy link

github-actions bot commented Dec 4, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@KuntaiDu
Copy link
Collaborator

KuntaiDu commented Dec 6, 2024

Now working on OSDI submission, will review after Dec 10.

@Jeffwan
Copy link
Contributor

Jeffwan commented Dec 8, 2024

This is a great demonstration to adopt mooncake to current disaggregation implementation. Could you share some benchmark data and best practice here? transfer engine's primary feature like more protocols support, topology-aware path selection would be beneficial in larger scale clusters. I am just curious how mooncake perform in 1P1D simple case or isomorphic environments.

@ShangmingCai
Copy link
Contributor Author

ShangmingCai commented Dec 9, 2024

This is a great demonstration to adopt mooncake to current disaggregation implementation. Could you share some benchmark data and best practice here? transfer engine's primary feature like more protocols support, topology-aware path selection would be beneficial in larger scale clusters. I am just curious how mooncake perform in 1P1D simple case or isomorphic environments.

Here are some preview mooncake benchmark results on A10 with up to 2 RDMA NICs. I am currently having some trouble benchmarking PyNcclConnector now. For some unknown reasons, it crashes a lot for inter-node disaggregated scenarios. And I am digging into the lookup_buffer and connector to try to identify the root cause. But I haven't found it. So the benchmark results haven't included the PyNcclConnector yet.

Varying tp (input length = 1024, qps = 2, output length =6)

Setting num_rdma_nic Successful Requests Duration (s) Total Input Tokens Total Generated Tokens Req Throughput (req/s) Output Token Throughput (tok/s) Total Token Throughput (tok/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean TPOT (ms) Median TPOT (ms) P99 TPOT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
tp = 1 2 200 99.47 201995 1200 2.01 12.06 2042.74 1056.76 635.00 4006.59 97.08 26.94 781.91 97.01 14.05 2205.51
tp = 2 2 200 98.98 201995 1200 2.02 12.12 2052.95 314.87 231.20 949.40 25.65 15.56 129.60 25.62 15.48 288.06
tp = 4 2 200 98.76 201995 1200 2.03 12.15 2057.44 198.10 160.03 461.61 23.52 18.93 94.38 23.50 18.01 187.79
tp = 1 1 200 99.44 201995 1200 2.01 12.07 2043.39 1071.12 631.56 4361.02 83.93 26.93 794.75 83.86 14.13 1932.66
tp = 2 1 200 98.96 201995 1200 2.02 12.13 2053.35 335.26 258.30 997.93 28.84 15.56 144.82 28.80 15.42 397.56
tp = 4 1 200 98.78 201995 1200 2.02 12.15 2057.03 201.68 162.85 456.33 22.31 16.74 94.76 22.29 16.73 189.13
tp = 1 TCP 200 99.55 201995 1200 2.01 12.05 2041.13 1414.05 766.23 6035.36 155.01 35.28 1191.24 154.91 14.32 3148.99
tp = 2 TCP 200 98.97 201995 1200 2.02 12.12 2053.03 333.74 251.32 954.63 28.74 15.49 161.24 28.70 15.35 393.52
tp = 4 TCP 200 98.78 201995 1200 2.02 12.15 2056.94 205.37 162.92 463.70 21.54 16.51 94.04 21.51 16.56 170.54

Varying qps (length = 1024, tp = 4, output length =6)

Setting num_rdma_nic Successful Requests Duration (s) Total Input Tokens Total Generated Tokens Req Throughput (req/s) Output Token Throughput (tok/s) Total Token Throughput (tok/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean TPOT (ms) Median TPOT (ms) P99 TPOT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
qps = 2 2 200 98.77 201995 1200 2.02 12.15 2057.33 200.64 156.62 478.22 22.63 17.35 99.61 22.60 17.08 186.25
qps = 4 2 200 49.75 201995 1200 4.02 24.12 4084.03 341.88 240.68 1430.54 38.36 18.39 313.45 38.31 17.17 588.80
qps = 6 2 200 33.44 201995 1200 5.98 35.88 6075.54 851.15 501.59 3239.89 102.51 47.67 606.77 102.34 18.35 1704.79
qps = 8 2 200 27.16 201995 1200 7.36 44.19 7482.52 4835.08 5733.45 8846.27 1276.59 1150.11 4401.23 1274.43 48.34 20682.35
qps = 2 1 200 98.77 201995 1200 2.02 12.15 2057.31 201.77 161.53 473.44 22.13 16.52 96.18 22.11 16.51 190.40
qps = 4 1 200 49.76 201995 1200 4.02 24.12 4083.83 337.31 243.38 1395.85 39.95 17.61 325.39 39.88 17.06 838.68
qps = 6 1 200 33.44 201995 1200 5.98 35.88 6075.99 820.53 458.84 3169.52 83.92 30.50 663.07 83.78 17.85 1306.32
qps = 8 1 200 27.19 201995 1200 7.36 44.14 7473.44 5291.91 6160.55 9596.56 1190.36 1040.63 4418.66 1188.33 47.61 20815.23
qps = 2 TCP 200 98.76 201995 1200 2.03 12.15 2057.42 207.22 160.81 511.01 22.17 16.59 94.96 22.15 16.59 181.82
qps = 4 TCP 200 49.79 201995 1200 4.02 24.10 4081.06 355.43 252.63 1554.91 40.15 16.92 314.28 40.09 16.66 708.50
qps = 6 TCP 200 33.49 201995 1200 5.97 35.83 6067.71 907.74 514.85 3253.93 122.75 45.51 648.40 122.56 18.09 2282.92
qps = 8 TCP 200 28.39 201995 1200 7.04 42.26 7156.09 6714.57 7885.09 11787.51 1116.06 408.32 4645.25 1114.29 46.87 21898.03

Varying input length (tp = 4, qps = 2, output length =6)

Setting num_rdma_nic Successful Requests Duration (s) Total Input Tokens Total Generated Tokens Req Throughput (req/s) Output Token Throughput (tok/s) Total Token Throughput (tok/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean TPOT (ms) Median TPOT (ms) P99 TPOT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
1024 2 200 98.77 201995 1200 2.02 12.15 2057.32 195.47 151.55 482.84 22.83 19.27 96.55 22.81 18.12 158.16
2048 2 200 99.22 406707 1200 2.02 12.09 4110.95 723.76 488.67 2941.96 67.25 18.93 632.73 67.20 17.49 1209.54
4096 2 200 117.42 818415 1200 1.70 10.22 6979.90 14616.48 18323.82 23191.04 8042.84 7593.16 19851.11 8040.02 65.43 93511.26
8192 2 200 247.77 1636065 1200 0.81 4.84 6608.10 75783.36 79331.60 147544.42 16961.27 15140.11 39278.98 16958.32 90.01 186151.61
1024 1 200 98.77 201995 1200 2.02 12.15 2057.31 201.77 161.53 473.44 22.13 16.52 96.18 22.11 16.51 190.40
2048 1 200 99.25 406707 1200 2.02 12.09 4109.96 719.43 482.02 3208.13 61.92 17.64 681.26 61.86 16.83 978.90
4096 1 200 111.88 818415 1200 1.79 10.73 7326.16 20362.10 22807.05 31853.55 5915.16 4521.51 18739.12 5913.18 67.03 81600.29
8192 1 200 270.01 1636065 1200 0.74 4.44 6063.79 103355.40 106546.65 172025.11 12894.35 11027.66 35110.13 12892.85 64.84 151774.68
1024 TCP 200 98.81 201995 1200 2.02 12.14 2056.44 203.32 160.83 460.90 21.81 16.96 95.27 21.78 16.91 171.80
2048 TCP 200 99.27 406707 1200 2.01 12.09 4108.98 731.60 484.78 3213.69 68.55 17.88 639.93 68.49 17.33 1257.45
4096 TCP 200 118.37 818415 1200 1.69 10.14 6923.89 23735.69 27101.97 36573.47 6386.62 5102.00 20032.26 6384.71 69.57 92811.27
8192 TCP 200 278.12 1636065 1200 0.72 4.31 5886.95 106873.23 109941.33 179781.64 13360.87 12155.24 36022.96 13359.20 68.01 156716.38

For best practice, I believe there is no best practice before XpYd is ready. But if you want to test the mooncake transfer engine, you can follow the guidance doc to reproduce the results.

In addition, we are also coordinating resources to integrate some machines with more RDMA NICs and more advanced GPUs. The official benchmark results will be released in due time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants