Skip to content

Add prefix cache aware scheduling #768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

liu-cong
Copy link
Contributor

@liu-cong liu-cong commented May 1, 2025

This is the implementation of the proposal in #602

This PR implements uses a Scheduler V2 controlled by an EXPERIMENTAL_USE_SCHEDULER_V2 env var to highlight its experimental status. Following up work is required to converge this with the V1 scheduler.

Initial Benchmark Results

The initial benchmark results showed that prefix aware scheduling can significantly reduce the TTFT. Will follow up with more detailed results, and the cost of the enabling prefix aware scheduling.

Benchmark Setup

Model server: vLLM 0.8.3 with --enable-prefix-caching, base model meta-llama/Llama-2-7b-hf, on 4 H100 80GB.

EPP baseline image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v20250329-79fedb5

Benchmark tool: SGLang benchmark tool, using the 'generated-shared-prefix' dataset.

High Prefix Cache Hit Ratio (system-prompt-len=3000, question-len=128)

python3 sglang/bench_serving.py --host=${IP} --port=${PORT} \
--dataset-name='generated-shared-prefix' --model=meta-llama/Llama-2-7b-hf --tokenizer=meta-llama/Llama-2-7b-hf \
--request-rate=20 --backend=vllm  \
--gsp-num-groups=64 \
--gsp-prompts-per-group=32 \
--gsp-system-prompt-len=3000 \
--gsp-question-len=128 \
--gsp-output-len=256 \
--max-concurrency=200 \
--output-file=prefix.json

Generated shared prefix dataset statistics:
Number of groups: 64
Prompts per group: 32
Total prompts: 2048
Total input tokens: 6702793
Total output tokens: 524288
Average system prompt length: 3136.7 tokens
Average question length: 135.1 tokens
  • Baseline results
============ Serving Benchmark Result ============
Backend:                                 vllm      
Traffic request rate:                    20.0      
Max reqeuest concurrency:                200       
Successful requests:                     2048      
Benchmark duration (s):                  126.99    
Total input tokens:                      6702793   
Total generated tokens:                  524288    
Total generated tokens (retokenized):    520128    
Request throughput (req/s):              16.13     
Input token throughput (tok/s):          52783.51  
Output token throughput (tok/s):         4128.69   
Total token throughput (tok/s):          56912.20  
Concurrency:                             183.71    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   11390.76  
Median E2E Latency (ms):                 11026.09  
---------------Time to First Token----------------
Mean TTFT (ms):                          1248.07   
Median TTFT (ms):                        592.66    
P99 TTFT (ms):                           7846.83   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           51.34     
Median ITL (ms):                         35.09     
P95 ITL (ms):                            115.64    
P99 ITL (ms):                            231.14    
Max ITL (ms):                            5477.88   
==================================================
  • Prefix cache aware scheduling results
============ Serving Benchmark Result ============
Backend:                                 vllm      
Traffic request rate:                    20.0      
Max reqeuest concurrency:                200       
Successful requests:                     2047      
Benchmark duration (s):                  111.24    
Total input tokens:                      6699507   
Total generated tokens:                  524032    
Total generated tokens (retokenized):    519614    
Request throughput (req/s):              18.40     
Input token throughput (tok/s):          60225.31  
Output token throughput (tok/s):         4710.79   
Total token throughput (tok/s):          64936.10  
Concurrency:                             139.03    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   7555.45   
Median E2E Latency (ms):                 7023.81   
---------------Time to First Token----------------
Mean TTFT (ms):                          313.49    
Median TTFT (ms):                        250.95    
P99 TTFT (ms):                           1260.57   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           36.65     
Median ITL (ms):                         29.03     
P95 ITL (ms):                            85.20     
P99 ITL (ms):                            124.13    
Max ITL (ms):                            20427.40  
==================================================

Low Prefix Cache Hit Ratio (system-prompt-len=128, question-len=3000)

python3 sglang/bench_serving.py --host=${IP} --port=${PORT} \
--dataset-name='generated-shared-prefix' --model=meta-llama/Llama-2-7b-hf --tokenizer=meta-llama/Llama-2-7b-hf \
--request-rate=20 --backend=vllm  \
--gsp-num-groups=64 \
--gsp-prompts-per-group=32 \
--gsp-system-prompt-len=128 \ 
--gsp-question-len=3000 \
--gsp-output-len=256 \
--max-concurrency=200 \
--output-file=prefix.json

Generated shared prefix dataset statistics:
Number of groups: 64
Prompts per group: 32
Total prompts: 2048
Total input tokens: 6708932
Total output tokens: 524288
Average system prompt length: 135.5 tokens
Average question length: 3139.2 tokens
  • Baseline results
============ Serving Benchmark Result ============
Backend:                                 vllm      
Traffic request rate:                    20.0      
Max reqeuest concurrency:                200       
Successful requests:                     2048      
Benchmark duration (s):                  153.06    
Total input tokens:                      6708932   
Total generated tokens:                  524288    
Total generated tokens (retokenized):    518178    
Request throughput (req/s):              13.38     
Input token throughput (tok/s):          43831.71  
Output token throughput (tok/s):         3425.35   
Total token throughput (tok/s):          47257.06  
Concurrency:                             185.88    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   13892.45  
Median E2E Latency (ms):                 14001.94  
---------------Time to First Token----------------
Mean TTFT (ms):                          4442.82   
Median TTFT (ms):                        4404.11   
P99 TTFT (ms):                           8979.85   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           44.82     
Median ITL (ms):                         28.51     
P95 ITL (ms):                            129.06    
P99 ITL (ms):                            229.93    
Max ITL (ms):                            5414.36   
==================================================
  • Prefix aware scheduling results
============ Serving Benchmark Result ============
Backend:                                 vllm      
Traffic request rate:                    20.0      
Max reqeuest concurrency:                200       
Successful requests:                     1467      
Benchmark duration (s):                  111.71    
Total input tokens:                      4806132   
Total generated tokens:                  375552    
Total generated tokens (retokenized):    371345    
Request throughput (req/s):              13.13     
Input token throughput (tok/s):          43024.53  
Output token throughput (tok/s):         3361.94   
Total token throughput (tok/s):          46386.48  
Concurrency:                             122.12    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   9298.94   
Median E2E Latency (ms):                 9072.91   
---------------Time to First Token----------------
Mean TTFT (ms):                          899.32    
Median TTFT (ms):                        621.78    
P99 TTFT (ms):                           6232.01   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           39.96     
Median ITL (ms):                         27.09     
P95 ITL (ms):                            95.30     
P99 ITL (ms):                            224.41    
Max ITL (ms):                            5133.16   
==================================================

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 1, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liu-cong
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 1, 2025
Copy link

netlify bot commented May 1, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 415a624
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/68139ade3d65f8000776191b
😎 Deploy Preview https://deploy-preview-768--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@@ -171,6 +190,10 @@ func run() error {
datastore := datastore.NewDatastore(ctx, pmf)

scheduler := scheduling.NewScheduler(datastore)
if schedulerV2 == "true" {
setupLog.Info("Creating scheduler with prefixCache plugin", "prefix cache config", prefixCacheConfig)
scheduler = scheduling.NewSchedulerV2(datastore, prefixCacheConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it’s WIP, but you may consider using func NewSchedulerWithConfig to reduce the noise.
(no need for NewSchedulerv2)

// If a request was routed to a server, record it in the cache:
func (m *plugin) PostSchedule(ctx *types.SchedulingContext, res *types.Result) {
targetPod := res.TargetPod.GetPod()
m.indexer.Add(ctx.PrefixHashes, types.ServerID(targetPod.NamespacedName))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code assumes that request was sent successfully? PostSchedule is after taged pod was selected but before the request is sent to that pod. the request might fail sending to the target pod due to various reasons and there might be changes (like using a fallback pod). I think this should be post response instead of post schedule with the actual pod that got the request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice I don't how much difference it is, given the likely hood of request failing is low. Also using PostSchedule also has the advantage of solving the "head of line blocking" when EPP just starts or a burst of requests, since it doesn't need to wait for response.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we guessed wrong (e.g., request failed or a fallback endpoint is picked), the cost is that next request is sent to a server which may not have the cache. However, after that the prefix should be cached and following requests should still get the cache hit.

@@ -58,6 +59,10 @@ type SchedulingContext struct {
Logger logr.Logger
Req *LLMRequest
PodsSnapshot []Pod
// PrefixHashes is a list of prefix hashes of the request prompt broken into blocks.
PrefixHashes []BlockHash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this is the right place for these fields.
looks very specific to prefix to be stored in a general purpose scheduling context. what if I don’t want to use prefix plugin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are "caches" in the life of a scheduling request. We use them at various points of the prefix plugins (Score and PostSchedule). IMP SchedulingContext is the place to share contextual info for the life of a scheduling request.

We can make this more structured maybe? Such as a map of plugin name to plugin specific contextual data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not save the req data in the plugin itself? this data is relevant only to prefix scorer. in preSchedule you can initialize prefix state and in post schedule clean it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE: storing the state per plugin

I can see this as a plausible solution. As you mentioned, the plugin then need to manage the lifecycle of this "cache" which adds complexity. Concretely this will likely be a map of request ID to its state and a mutex to synchronize the access. It's additional complexity but not crazy.

Perhaps we need to align on a principle here, how about this?

  1. Plugin should maintain their own state if the state isn't shared outside of the plugin
  2. Share state via the SchedulingContext

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, this principle sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants