Add prefix cache aware scheduling #768

liu-cong · 2025-05-01T04:58:04Z

This is the implementation of the proposal in #602

This PR implements uses a Scheduler V2 controlled by an EXPERIMENTAL_USE_SCHEDULER_V2 env var to highlight its experimental status. Following up work is required to converge this with the V1 scheduler.

Initial Benchmark Results

The initial benchmark results showed that prefix aware scheduling can significantly reduce the TTFT. Will follow up with more detailed results, and the cost of the enabling prefix aware scheduling.

Benchmark Setup

Model server: vLLM 0.8.3 with --enable-prefix-caching, base model meta-llama/Llama-2-7b-hf, on 4 H100 80GB.

EPP baseline image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v20250329-79fedb5

Benchmark tool: SGLang benchmark tool, using the 'generated-shared-prefix' dataset.

High Prefix Cache Hit Ratio (system-prompt-len=3000, question-len=128)

python3 sglang/bench_serving.py --host=${IP} --port=${PORT} \
--dataset-name='generated-shared-prefix' --model=meta-llama/Llama-2-7b-hf --tokenizer=meta-llama/Llama-2-7b-hf \
--request-rate=20 --backend=vllm  \
--gsp-num-groups=64 \
--gsp-prompts-per-group=32 \
--gsp-system-prompt-len=3000 \
--gsp-question-len=128 \
--gsp-output-len=256 \
--max-concurrency=200 \
--output-file=prefix.json

Generated shared prefix dataset statistics:
Number of groups: 64
Prompts per group: 32
Total prompts: 2048
Total input tokens: 6702793
Total output tokens: 524288
Average system prompt length: 3136.7 tokens
Average question length: 135.1 tokens

Baseline results

============ Serving Benchmark Result ============
Backend:                                 vllm      
Traffic request rate:                    20.0      
Max reqeuest concurrency:                200       
Successful requests:                     2048      
Benchmark duration (s):                  126.99    
Total input tokens:                      6702793   
Total generated tokens:                  524288    
Total generated tokens (retokenized):    520128    
Request throughput (req/s):              16.13     
Input token throughput (tok/s):          52783.51  
Output token throughput (tok/s):         4128.69   
Total token throughput (tok/s):          56912.20  
Concurrency:                             183.71    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   11390.76  
Median E2E Latency (ms):                 11026.09  
---------------Time to First Token----------------
Mean TTFT (ms):                          1248.07   
Median TTFT (ms):                        592.66    
P99 TTFT (ms):                           7846.83   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           51.34     
Median ITL (ms):                         35.09     
P95 ITL (ms):                            115.64    
P99 ITL (ms):                            231.14    
Max ITL (ms):                            5477.88   
==================================================

Prefix cache aware scheduling results

============ Serving Benchmark Result ============
Backend:                                 vllm      
Traffic request rate:                    20.0      
Max reqeuest concurrency:                200       
Successful requests:                     2047      
Benchmark duration (s):                  111.24    
Total input tokens:                      6699507   
Total generated tokens:                  524032    
Total generated tokens (retokenized):    519614    
Request throughput (req/s):              18.40     
Input token throughput (tok/s):          60225.31  
Output token throughput (tok/s):         4710.79   
Total token throughput (tok/s):          64936.10  
Concurrency:                             139.03    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   7555.45   
Median E2E Latency (ms):                 7023.81   
---------------Time to First Token----------------
Mean TTFT (ms):                          313.49    
Median TTFT (ms):                        250.95    
P99 TTFT (ms):                           1260.57   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           36.65     
Median ITL (ms):                         29.03     
P95 ITL (ms):                            85.20     
P99 ITL (ms):                            124.13    
Max ITL (ms):                            20427.40  
==================================================

Low Prefix Cache Hit Ratio (system-prompt-len=128, question-len=3000)

python3 sglang/bench_serving.py --host=${IP} --port=${PORT} \
--dataset-name='generated-shared-prefix' --model=meta-llama/Llama-2-7b-hf --tokenizer=meta-llama/Llama-2-7b-hf \
--request-rate=20 --backend=vllm  \
--gsp-num-groups=64 \
--gsp-prompts-per-group=32 \
--gsp-system-prompt-len=128 \ 
--gsp-question-len=3000 \
--gsp-output-len=256 \
--max-concurrency=200 \
--output-file=prefix.json

Generated shared prefix dataset statistics:
Number of groups: 64
Prompts per group: 32
Total prompts: 2048
Total input tokens: 6708932
Total output tokens: 524288
Average system prompt length: 135.5 tokens
Average question length: 3139.2 tokens

Baseline results

============ Serving Benchmark Result ============
Backend:                                 vllm      
Traffic request rate:                    20.0      
Max reqeuest concurrency:                200       
Successful requests:                     2048      
Benchmark duration (s):                  153.06    
Total input tokens:                      6708932   
Total generated tokens:                  524288    
Total generated tokens (retokenized):    518178    
Request throughput (req/s):              13.38     
Input token throughput (tok/s):          43831.71  
Output token throughput (tok/s):         3425.35   
Total token throughput (tok/s):          47257.06  
Concurrency:                             185.88    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   13892.45  
Median E2E Latency (ms):                 14001.94  
---------------Time to First Token----------------
Mean TTFT (ms):                          4442.82   
Median TTFT (ms):                        4404.11   
P99 TTFT (ms):                           8979.85   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           44.82     
Median ITL (ms):                         28.51     
P95 ITL (ms):                            129.06    
P99 ITL (ms):                            229.93    
Max ITL (ms):                            5414.36   
==================================================

Prefix aware scheduling results

============ Serving Benchmark Result ============
Backend:                                 vllm      
Traffic request rate:                    20.0      
Max reqeuest concurrency:                200       
Successful requests:                     1467      
Benchmark duration (s):                  111.71    
Total input tokens:                      4806132   
Total generated tokens:                  375552    
Total generated tokens (retokenized):    371345    
Request throughput (req/s):              13.13     
Input token throughput (tok/s):          43024.53  
Output token throughput (tok/s):         3361.94   
Total token throughput (tok/s):          46386.48  
Concurrency:                             122.12    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   9298.94   
Median E2E Latency (ms):                 9072.91   
---------------Time to First Token----------------
Mean TTFT (ms):                          899.32    
Median TTFT (ms):                        621.78    
P99 TTFT (ms):                           6232.01   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           39.96     
Median ITL (ms):                         27.09     
P95 ITL (ms):                            95.30     
P99 ITL (ms):                            224.41    
Max ITL (ms):                            5133.16   
==================================================

k8s-ci-robot · 2025-05-01T04:58:06Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2025-05-01T04:58:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liu-cong
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2025-05-01T04:58:22Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`415a624`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/68139ade3d65f8000776191b
😎 Deploy Preview	https://deploy-preview-768--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

nirrozenbaum · 2025-05-01T08:27:25Z

cmd/epp/main.go

@@ -171,6 +190,10 @@ func run() error {
 	datastore := datastore.NewDatastore(ctx, pmf)

 	scheduler := scheduling.NewScheduler(datastore)
+	if schedulerV2 == "true" {
+		setupLog.Info("Creating scheduler with prefixCache plugin", "prefix cache config", prefixCacheConfig)
+		scheduler = scheduling.NewSchedulerV2(datastore, prefixCacheConfig)


I know it’s WIP, but you may consider using func NewSchedulerWithConfig to reduce the noise.
(no need for NewSchedulerv2)

nirrozenbaum · 2025-05-01T08:37:39Z

pkg/epp/scheduling/plugins/prefix/plugin.go

+// If a request was routed to a server, record it in the cache:
+func (m *plugin) PostSchedule(ctx *types.SchedulingContext, res *types.Result) {
+	targetPod := res.TargetPod.GetPod()
+	m.indexer.Add(ctx.PrefixHashes, types.ServerID(targetPod.NamespacedName))


this code assumes that request was sent successfully? PostSchedule is after taged pod was selected but before the request is sent to that pod. the request might fail sending to the target pod due to various reasons and there might be changes (like using a fallback pod). I think this should be post response instead of post schedule with the actual pod that got the request.

In practice I don't how much difference it is, given the likely hood of request failing is low. Also using PostSchedule also has the advantage of solving the "head of line blocking" when EPP just starts or a burst of requests, since it doesn't need to wait for response.

Even if we guessed wrong (e.g., request failed or a fallback endpoint is picked), the cost is that next request is sent to a server which may not have the cache. However, after that the prefix should be cached and following requests should still get the cache hit.

pkg/epp/scheduling/scheduler_v2.go

nirrozenbaum · 2025-05-01T08:48:25Z

pkg/epp/scheduling/types/types.go

@@ -58,6 +59,10 @@ type SchedulingContext struct {
 	Logger       logr.Logger
 	Req          *LLMRequest
 	PodsSnapshot []Pod
+	// PrefixHashes is a list of prefix hashes of the request prompt broken into blocks.
+	PrefixHashes []BlockHash


not sure this is the right place for these fields.
looks very specific to prefix to be stored in a general purpose scheduling context. what if I don’t want to use prefix plugin?

These are "caches" in the life of a scheduling request. We use them at various points of the prefix plugins (Score and PostSchedule). IMP SchedulingContext is the place to share contextual info for the life of a scheduling request.

We can make this more structured maybe? Such as a map of plugin name to plugin specific contextual data.

why not save the req data in the plugin itself? this data is relevant only to prefix scorer. in preSchedule you can initialize prefix state and in post schedule clean it.

RE: storing the state per plugin

I can see this as a plausible solution. As you mentioned, the plugin then need to manage the lifecycle of this "cache" which adds complexity. Concretely this will likely be a map of request ID to its state and a mutex to synchronize the access. It's additional complexity but not crazy.

Perhaps we need to align on a principle here, how about this?

Plugin should maintain their own state if the state isn't shared outside of the plugin

Share state via the SchedulingContext

yup, this principle sounds good to me.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 1, 2025

k8s-ci-robot requested review from Jeffwan and nirrozenbaum May 1, 2025 04:58

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 1, 2025

nirrozenbaum reviewed May 1, 2025

View reviewed changes

pkg/epp/scheduling/scheduler_v2.go Outdated Show resolved Hide resolved

nirrozenbaum reviewed May 1, 2025

View reviewed changes

Add prefix cache aware scheduling

415a624

liu-cong force-pushed the prefix-cache branch from b275441 to 415a624 Compare May 1, 2025 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prefix cache aware scheduling #768

Add prefix cache aware scheduling #768

liu-cong commented May 1, 2025 •

edited

Loading

k8s-ci-robot commented May 1, 2025

k8s-ci-robot commented May 1, 2025

netlify bot commented May 1, 2025 •

edited

Loading

nirrozenbaum May 1, 2025

nirrozenbaum May 1, 2025

liu-cong May 1, 2025

liu-cong May 1, 2025

nirrozenbaum May 1, 2025

liu-cong May 1, 2025

nirrozenbaum May 1, 2025

liu-cong May 1, 2025

nirrozenbaum May 1, 2025

Add prefix cache aware scheduling #768

Are you sure you want to change the base?

Add prefix cache aware scheduling #768

Conversation

liu-cong commented May 1, 2025 • edited Loading

Initial Benchmark Results

Benchmark Setup

High Prefix Cache Hit Ratio (system-prompt-len=3000, question-len=128)

Low Prefix Cache Hit Ratio (system-prompt-len=128, question-len=3000)

k8s-ci-robot commented May 1, 2025

k8s-ci-robot commented May 1, 2025

netlify bot commented May 1, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liu-cong commented May 1, 2025 •

edited

Loading

netlify bot commented May 1, 2025 •

edited

Loading