[Core] Support disaggregated prefill with Mooncake Transfer Engine #10884

ShangmingCai · 2024-12-04T03:01:40Z

We really appreciate @KuntaiDu for his remarkable work in supporting the disaggregated prefill feature in vLLM. Since PR #10502 has been merged. After rebase, we switch the mooncake integration from PR #10728 to here.

This PR is related to #10727, as well as a continuation of PR #10502, which uses Mooncake's Transfer Engine for KVCache transfer instead of NCCL.

Mooncake is a KVCache-centric disaggregated architecture for LLM serving. Transfer Engine is the core component of Mooncake, see documentations for its design & API list.

Compared with NCCL, Mooncake Transfer Engine has the following features:

a unified programming interface for data transfers between DRAM-to-DRAM (both local and remote), DRAM-to-GPU VRAM (both local and remote), and DRAM-to-remote NVMe devices
support for TCP, RDMA, and NVMe-of protocols
topology-aware path selection (link to our English doc, transfer_engine.md), aggregating bandwidth from multiple NICs

Like the current implementation of PR #10502, there are two roles: KV provider (e.g. prefill vLLM instance) and KV consumer (e.g. decode vLLM instance)

Provider side implements insert: insert a KV cache into a buffer, so that it can be transferred upon request
Consumer side implements drop_select: select a KV cache based on tokens, transfer the selected KV, and drop this KV out from the buffer

Both roles are run on different machines.

Integration guide: https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm-integration-v0.2-nightly.md

Benchmark result: ~~https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm_benchmark_results.md~~ New benchmark results will be added soon.

Test files will be added to align with the future test CI pipeline for PR #10502.

CC List.
@KuntaiDu @youkaichao @alogfans @stmatengss @james0zan

Signed-off-by: Shangming Cai <[email protected]>

github-actions · 2024-12-04T03:01:54Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Shangming Cai <[email protected]>

KuntaiDu · 2024-12-06T06:47:20Z

Now working on OSDI submission, will review after Dec 10.

Jeffwan · 2024-12-08T19:19:15Z

This is a great demonstration to adopt mooncake to current disaggregation implementation. Could you share some benchmark data and best practice here? transfer engine's primary feature like more protocols support, topology-aware path selection would be beneficial in larger scale clusters. I am just curious how mooncake perform in 1P1D simple case or isomorphic environments.

ShangmingCai · 2024-12-09T02:51:34Z

This is a great demonstration to adopt mooncake to current disaggregation implementation. Could you share some benchmark data and best practice here? transfer engine's primary feature like more protocols support, topology-aware path selection would be beneficial in larger scale clusters. I am just curious how mooncake perform in 1P1D simple case or isomorphic environments.

Here are some preview mooncake benchmark results on A10 with up to 2 RDMA NICs. I am currently having some trouble benchmarking PyNcclConnector now. For some unknown reasons, it crashes a lot for inter-node disaggregated scenarios. And I am digging into the lookup_buffer and connector to try to identify the root cause. But I haven't found it. So the benchmark results haven't included the PyNcclConnector yet.

Varying tp (input length = 1024, qps = 2, output length =6)

Setting	num_rdma_nic	Successful Requests	Duration (s)	Total Input Tokens	Total Generated Tokens	Req Throughput (req/s)	Output Token Throughput (tok/s)	Total Token Throughput (tok/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)	Mean TPOT (ms)	Median TPOT (ms)	P99 TPOT (ms)	Mean ITL (ms)	Median ITL (ms)	P99 ITL (ms)
tp = 1	2	200	99.47	201995	1200	2.01	12.06	2042.74	1056.76	635.00	4006.59	97.08	26.94	781.91	97.01	14.05	2205.51
tp = 2	2	200	98.98	201995	1200	2.02	12.12	2052.95	314.87	231.20	949.40	25.65	15.56	129.60	25.62	15.48	288.06
tp = 4	2	200	98.76	201995	1200	2.03	12.15	2057.44	198.10	160.03	461.61	23.52	18.93	94.38	23.50	18.01	187.79
tp = 1	1	200	99.44	201995	1200	2.01	12.07	2043.39	1071.12	631.56	4361.02	83.93	26.93	794.75	83.86	14.13	1932.66
tp = 2	1	200	98.96	201995	1200	2.02	12.13	2053.35	335.26	258.30	997.93	28.84	15.56	144.82	28.80	15.42	397.56
tp = 4	1	200	98.78	201995	1200	2.02	12.15	2057.03	201.68	162.85	456.33	22.31	16.74	94.76	22.29	16.73	189.13
tp = 1	TCP	200	99.55	201995	1200	2.01	12.05	2041.13	1414.05	766.23	6035.36	155.01	35.28	1191.24	154.91	14.32	3148.99
tp = 2	TCP	200	98.97	201995	1200	2.02	12.12	2053.03	333.74	251.32	954.63	28.74	15.49	161.24	28.70	15.35	393.52
tp = 4	TCP	200	98.78	201995	1200	2.02	12.15	2056.94	205.37	162.92	463.70	21.54	16.51	94.04	21.51	16.56	170.54

Varying qps (length = 1024, tp = 4, output length =6)

Setting	num_rdma_nic	Successful Requests	Duration (s)	Total Input Tokens	Total Generated Tokens	Req Throughput (req/s)	Output Token Throughput (tok/s)	Total Token Throughput (tok/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)	Mean TPOT (ms)	Median TPOT (ms)	P99 TPOT (ms)	Mean ITL (ms)	Median ITL (ms)	P99 ITL (ms)
qps = 2	2	200	98.77	201995	1200	2.02	12.15	2057.33	200.64	156.62	478.22	22.63	17.35	99.61	22.60	17.08	186.25
qps = 4	2	200	49.75	201995	1200	4.02	24.12	4084.03	341.88	240.68	1430.54	38.36	18.39	313.45	38.31	17.17	588.80
qps = 6	2	200	33.44	201995	1200	5.98	35.88	6075.54	851.15	501.59	3239.89	102.51	47.67	606.77	102.34	18.35	1704.79
qps = 8	2	200	27.16	201995	1200	7.36	44.19	7482.52	4835.08	5733.45	8846.27	1276.59	1150.11	4401.23	1274.43	48.34	20682.35
qps = 2	1	200	98.77	201995	1200	2.02	12.15	2057.31	201.77	161.53	473.44	22.13	16.52	96.18	22.11	16.51	190.40
qps = 4	1	200	49.76	201995	1200	4.02	24.12	4083.83	337.31	243.38	1395.85	39.95	17.61	325.39	39.88	17.06	838.68
qps = 6	1	200	33.44	201995	1200	5.98	35.88	6075.99	820.53	458.84	3169.52	83.92	30.50	663.07	83.78	17.85	1306.32
qps = 8	1	200	27.19	201995	1200	7.36	44.14	7473.44	5291.91	6160.55	9596.56	1190.36	1040.63	4418.66	1188.33	47.61	20815.23
qps = 2	TCP	200	98.76	201995	1200	2.03	12.15	2057.42	207.22	160.81	511.01	22.17	16.59	94.96	22.15	16.59	181.82
qps = 4	TCP	200	49.79	201995	1200	4.02	24.10	4081.06	355.43	252.63	1554.91	40.15	16.92	314.28	40.09	16.66	708.50
qps = 6	TCP	200	33.49	201995	1200	5.97	35.83	6067.71	907.74	514.85	3253.93	122.75	45.51	648.40	122.56	18.09	2282.92
qps = 8	TCP	200	28.39	201995	1200	7.04	42.26	7156.09	6714.57	7885.09	11787.51	1116.06	408.32	4645.25	1114.29	46.87	21898.03

Varying input length (tp = 4, qps = 2, output length =6)

Setting	num_rdma_nic	Successful Requests	Duration (s)	Total Input Tokens	Total Generated Tokens	Req Throughput (req/s)	Output Token Throughput (tok/s)	Total Token Throughput (tok/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)	Mean TPOT (ms)	Median TPOT (ms)	P99 TPOT (ms)	Mean ITL (ms)	Median ITL (ms)	P99 ITL (ms)
1024	2	200	98.77	201995	1200	2.02	12.15	2057.32	195.47	151.55	482.84	22.83	19.27	96.55	22.81	18.12	158.16
2048	2	200	99.22	406707	1200	2.02	12.09	4110.95	723.76	488.67	2941.96	67.25	18.93	632.73	67.20	17.49	1209.54
4096	2	200	117.42	818415	1200	1.70	10.22	6979.90	14616.48	18323.82	23191.04	8042.84	7593.16	19851.11	8040.02	65.43	93511.26
8192	2	200	247.77	1636065	1200	0.81	4.84	6608.10	75783.36	79331.60	147544.42	16961.27	15140.11	39278.98	16958.32	90.01	186151.61
1024	1	200	98.77	201995	1200	2.02	12.15	2057.31	201.77	161.53	473.44	22.13	16.52	96.18	22.11	16.51	190.40
2048	1	200	99.25	406707	1200	2.02	12.09	4109.96	719.43	482.02	3208.13	61.92	17.64	681.26	61.86	16.83	978.90
4096	1	200	111.88	818415	1200	1.79	10.73	7326.16	20362.10	22807.05	31853.55	5915.16	4521.51	18739.12	5913.18	67.03	81600.29
8192	1	200	270.01	1636065	1200	0.74	4.44	6063.79	103355.40	106546.65	172025.11	12894.35	11027.66	35110.13	12892.85	64.84	151774.68
1024	TCP	200	98.81	201995	1200	2.02	12.14	2056.44	203.32	160.83	460.90	21.81	16.96	95.27	21.78	16.91	171.80
2048	TCP	200	99.27	406707	1200	2.01	12.09	4108.98	731.60	484.78	3213.69	68.55	17.88	639.93	68.49	17.33	1257.45
4096	TCP	200	118.37	818415	1200	1.69	10.14	6923.89	23735.69	27101.97	36573.47	6386.62	5102.00	20032.26	6384.71	69.57	92811.27
8192	TCP	200	278.12	1636065	1200	0.72	4.31	5886.95	106873.23	109941.33	179781.64	13360.87	12155.24	36022.96	13359.20	68.01	156716.38

For best practice, I believe there is no best practice before XpYd is ready. But if you want to test the mooncake transfer engine, you can follow the guidance doc to reproduce the results.

In addition, we are also coordinating resources to integrate some machines with more RDMA NICs and more advanced GPUs. The official benchmark results will be released in due time.

…a and Turing GPUs. Signed-off-by: Shangming Cai <[email protected]>

Signed-off-by: Shangming Cai <[email protected]>

ShangmingCai added 7 commits December 2, 2024 19:37

Rebase from main to work with PR 10502.

d52dbc8

Signed-off-by: Shangming Cai <[email protected]>

Update format of mooncake config ValueError.

c8e9d07

Signed-off-by: Shangming Cai <[email protected]>

Modify metadata transfer logic to support tp.

b718f1e

Signed-off-by: Shangming Cai <[email protected]>

Fix format to make ruff happy.

08e2800

Signed-off-by: Shangming Cai <[email protected]>

Add instructions when mooncake is not installed.

8179746

Signed-off-by: Shangming Cai <[email protected]>

Merge branch 'main' into upstream-mooncake-integration

ba82d71

Merge branch 'main' into upstream-mooncake-integration

76d484c

ShangmingCai added 2 commits December 4, 2024 11:06

fix import order to make isort happy.

e912055

Signed-off-by: Shangming Cai <[email protected]>

Fix format to make yapf happy.

2396f01

Signed-off-by: Shangming Cai <[email protected]>

ShangmingCai mentioned this pull request Dec 4, 2024

[Core] Refactoring disaggregated prefilling/decoding using Mooncake Transfer Engine #10728

Open

youkaichao requested a review from KuntaiDu December 4, 2024 06:55

ShangmingCai added 2 commits December 4, 2024 15:00

Add solution for ports conflict on the same node.

31514a0

Signed-off-by: Shangming Cai <[email protected]>

Fix format to make mypy happy.

2ef10be

Signed-off-by: Shangming Cai <[email protected]>

ShangmingCai added 4 commits December 10, 2024 11:30

Get head_size and num_heads from model config to address bugs on Volt…

0823e47

…a and Turing GPUs. Signed-off-by: Shangming Cai <[email protected]>

Add support for other metadata server backend.

6fb95fb

Signed-off-by: Shangming Cai <[email protected]>

Change code to align with PR 11058.

a5758b1

Signed-off-by: Shangming Cai <[email protected]>

Fix typo.

33e4455

Signed-off-by: Shangming Cai <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Support disaggregated prefill with Mooncake Transfer Engine #10884

[Core] Support disaggregated prefill with Mooncake Transfer Engine #10884

ShangmingCai commented Dec 4, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 4, 2024

KuntaiDu commented Dec 6, 2024

Jeffwan commented Dec 8, 2024

ShangmingCai commented Dec 9, 2024 •

edited

Loading

[Core] Support disaggregated prefill with Mooncake Transfer Engine #10884

Are you sure you want to change the base?

[Core] Support disaggregated prefill with Mooncake Transfer Engine #10884

Conversation

ShangmingCai commented Dec 4, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 4, 2024

KuntaiDu commented Dec 6, 2024

Jeffwan commented Dec 8, 2024

ShangmingCai commented Dec 9, 2024 • edited Loading

Varying tp (input length = 1024, qps = 2, output length =6)

Varying qps (length = 1024, tp = 4, output length =6)

Varying input length (tp = 4, qps = 2, output length =6)

ShangmingCai commented Dec 4, 2024 •

edited by github-actions bot

Loading

ShangmingCai commented Dec 9, 2024 •

edited

Loading