-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Support disaggregated prefill with Mooncake Transfer Engine #10884
base: main
Are you sure you want to change the base?
[Core] Support disaggregated prefill with Mooncake Transfer Engine #10884
Conversation
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Now working on OSDI submission, will review after Dec 10. |
This is a great demonstration to adopt mooncake to current disaggregation implementation. Could you share some benchmark data and best practice here? transfer engine's primary feature like more protocols support, topology-aware path selection would be beneficial in larger scale clusters. I am just curious how mooncake perform in 1P1D simple case or isomorphic environments. |
Here are some preview mooncake benchmark results on A10 with up to 2 RDMA NICs. I am currently having some trouble benchmarking Varying tp (input length = 1024, qps = 2, output length =6)
Varying qps (length = 1024, tp = 4, output length =6)
Varying input length (tp = 4, qps = 2, output length =6)
For best practice, I believe there is no best practice before XpYd is ready. But if you want to test the mooncake transfer engine, you can follow the guidance doc to reproduce the results. In addition, we are also coordinating resources to integrate some machines with more RDMA NICs and more advanced GPUs. The official benchmark results will be released in due time. |
…a and Turing GPUs. Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
We really appreciate @KuntaiDu for his remarkable work in supporting the disaggregated prefill feature in vLLM. Since PR #10502 has been merged. After rebase, we switch the mooncake integration from PR #10728 to here.
This PR is related to #10727, as well as a continuation of PR #10502, which uses Mooncake's Transfer Engine for KVCache transfer instead of NCCL.
Mooncake is a KVCache-centric disaggregated architecture for LLM serving. Transfer Engine is the core component of Mooncake, see documentations for its design & API list.
Compared with NCCL, Mooncake Transfer Engine has the following features:
Like the current implementation of PR #10502, there are two roles: KV provider (e.g. prefill vLLM instance) and KV consumer (e.g. decode vLLM instance)
insert
: insert a KV cache into a buffer, so that it can be transferred upon requestdrop_select
: select a KV cache based on tokens, transfer the selected KV, and drop this KV out from the bufferBoth roles are run on different machines.
Integration guide: https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm-integration-v0.2-nightly.md
Benchmark result:
https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm_benchmark_results.mdNew benchmark results will be added soon.Test files will be added to align with the future test CI pipeline for PR #10502.
CC List.
@KuntaiDu @youkaichao @alogfans @stmatengss @james0zan