Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefix cache and load aware routing policy #677

Open
6 tasks
gangmuk opened this issue Feb 14, 2025 · 0 comments
Open
6 tasks

Prefix cache and load aware routing policy #677

gangmuk opened this issue Feb 14, 2025 · 0 comments
Assignees
Labels
area/gateway area/kv-cache area/performance area/scheduling kind/enhancement New feature or request kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@gangmuk
Copy link
Collaborator

gangmuk commented Feb 14, 2025

🚀 Feature Description and Motivation

Currently, AiBrix is supporting a simple prefix-aware routing. From data structure perspective, it is using hash table with fixed size of a block. A block represents a certain number of consecutive tokens.

More sophisticated prefix aware routing would be useful, for example, Preble, SGLang, D^2LPM.

As the initial prototype implementation, I am planning to prototype Preble scheduling in AiBrix quickly with best effort.

Work items

  • Implementing radix tree based cache (needs to be implemented in aibrix/pkg/plugins/gateway/prefixcacheindexer)
  • Implementing Preble-like new routing policy that considers load and prefix at the same time more carefully (the routing logics needs to be implemented in aibrix/pkg/plugins/gateway/algorithms)
  • Benchmarking the performance of the new routing policy compared to the current prefix routing policy and load-only-aware policy
    • latency metrics
    • cache hit ratio
    • GPU memory utilization (or kv cache util in memory different from kv cache hit ratio)

One flaky thing about Preble scheduling logic is that it requires some magic numbers.

  • prefill for a specific LLM model in a certain GPU
  • decoding for a specific LLM model in a certain GPU
    They use linear regression and the coefficient and intercept are hardcoded in Preble code.

Use Case

Better routing for better performance

Proposed Solution

No response

@gangmuk gangmuk added kind/enhancement New feature or request area/gateway kind/feature Categorizes issue or PR as related to a new feature. area/performance area/scheduling area/kv-cache labels Feb 14, 2025
@gangmuk gangmuk changed the title Routing policy that considers both cached prefix and load together (e.g., Preble) Prefix cache and load aware routing policy (e.g., Preble) Feb 15, 2025
@Jeffwan Jeffwan added this to the v0.3.0 milestone Feb 15, 2025
@gangmuk gangmuk changed the title Prefix cache and load aware routing policy (e.g., Preble) Prefix cache and load aware routing policy Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/gateway area/kv-cache area/performance area/scheduling kind/enhancement New feature or request kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants