-
Notifications
You must be signed in to change notification settings - Fork 73
Add prefix cache aware scheduling #768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: liu-cong The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
@@ -171,6 +190,10 @@ func run() error { | |||
datastore := datastore.NewDatastore(ctx, pmf) | |||
|
|||
scheduler := scheduling.NewScheduler(datastore) | |||
if schedulerV2 == "true" { | |||
setupLog.Info("Creating scheduler with prefixCache plugin", "prefix cache config", prefixCacheConfig) | |||
scheduler = scheduling.NewSchedulerV2(datastore, prefixCacheConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it’s WIP, but you may consider using func NewSchedulerWithConfig to reduce the noise.
(no need for NewSchedulerv2)
// If a request was routed to a server, record it in the cache: | ||
func (m *plugin) PostSchedule(ctx *types.SchedulingContext, res *types.Result) { | ||
targetPod := res.TargetPod.GetPod() | ||
m.indexer.Add(ctx.PrefixHashes, types.ServerID(targetPod.NamespacedName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code assumes that request was sent successfully? PostSchedule is after taged pod was selected but before the request is sent to that pod. the request might fail sending to the target pod due to various reasons and there might be changes (like using a fallback pod). I think this should be post response instead of post schedule with the actual pod that got the request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice I don't how much difference it is, given the likely hood of request failing is low. Also using PostSchedule also has the advantage of solving the "head of line blocking" when EPP just starts or a burst of requests, since it doesn't need to wait for response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we guessed wrong (e.g., request failed or a fallback endpoint is picked), the cost is that next request is sent to a server which may not have the cache. However, after that the prefix should be cached and following requests should still get the cache hit.
@@ -58,6 +59,10 @@ type SchedulingContext struct { | |||
Logger logr.Logger | |||
Req *LLMRequest | |||
PodsSnapshot []Pod | |||
// PrefixHashes is a list of prefix hashes of the request prompt broken into blocks. | |||
PrefixHashes []BlockHash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure this is the right place for these fields.
looks very specific to prefix to be stored in a general purpose scheduling context. what if I don’t want to use prefix plugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are "caches" in the life of a scheduling request. We use them at various points of the prefix plugins (Score and PostSchedule). IMP SchedulingContext is the place to share contextual info for the life of a scheduling request.
We can make this more structured maybe? Such as a map of plugin name to plugin specific contextual data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not save the req data in the plugin itself? this data is relevant only to prefix scorer. in preSchedule you can initialize prefix state and in post schedule clean it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE: storing the state per plugin
I can see this as a plausible solution. As you mentioned, the plugin then need to manage the lifecycle of this "cache" which adds complexity. Concretely this will likely be a map of request ID to its state and a mutex to synchronize the access. It's additional complexity but not crazy.
Perhaps we need to align on a principle here, how about this?
- Plugin should maintain their own state if the state isn't shared outside of the plugin
- Share state via the SchedulingContext
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, this principle sounds good to me.
This is the implementation of the proposal in #602
This PR implements uses a Scheduler V2 controlled by an
EXPERIMENTAL_USE_SCHEDULER_V2
env var to highlight its experimental status. Following up work is required to converge this with the V1 scheduler.Initial Benchmark Results
The initial benchmark results showed that prefix aware scheduling can significantly reduce the TTFT. Will follow up with more detailed results, and the cost of the enabling prefix aware scheduling.
Benchmark Setup
Model server: vLLM 0.8.3 with
--enable-prefix-caching
, base modelmeta-llama/Llama-2-7b-hf
, on 4 H100 80GB.EPP baseline image:
us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v20250329-79fedb5
Benchmark tool: SGLang benchmark tool, using the
'generated-shared-prefix'
dataset.High Prefix Cache Hit Ratio (system-prompt-len=3000, question-len=128)
Low Prefix Cache Hit Ratio (system-prompt-len=128, question-len=3000)