Prefix cache and load aware routing policy #677

gangmuk · 2025-02-14T21:53:40Z

🚀 Feature Description and Motivation

Currently, AiBrix is supporting a simple prefix-aware routing. From data structure perspective, it is using hash table with fixed size of a block. A block represents a certain number of consecutive tokens.

More sophisticated prefix aware routing would be useful, for example, Preble, SGLang, D^2LPM.

As the initial prototype implementation, I am planning to prototype Preble scheduling in AiBrix quickly with best effort.

Work items

Implementing radix tree based cache (needs to be implemented in aibrix/pkg/plugins/gateway/prefixcacheindexer)
Implementing Preble-like new routing policy that considers load and prefix at the same time more carefully (the routing logics needs to be implemented in aibrix/pkg/plugins/gateway/algorithms)
Benchmarking the performance of the new routing policy compared to the current prefix routing policy and load-only-aware policy
- latency metrics
- cache hit ratio
- GPU memory utilization (or kv cache util in memory different from kv cache hit ratio)

One flaky thing about Preble scheduling logic is that it requires some magic numbers.

prefill for a specific LLM model in a certain GPU
decoding for a specific LLM model in a certain GPU
They use linear regression and the coefficient and intercept are hardcoded in Preble code.

Use Case

Better routing for better performance

Proposed Solution

No response

The text was updated successfully, but these errors were encountered:

gangmuk mentioned this issue Feb 14, 2025

Initial implementation of radix tree-based cache #678

Merged

gangmuk added kind/enhancement New feature or request area/gateway kind/feature Categorizes issue or PR as related to a new feature. area/performance area/scheduling area/kv-cache labels Feb 14, 2025

gangmuk changed the title ~~Routing policy that considers both cached prefix and load together (e.g., Preble)~~ Prefix cache and load aware routing policy (e.g., Preble) Feb 15, 2025

Jeffwan assigned gangmuk Feb 15, 2025

Jeffwan added this to the v0.3.0 milestone Feb 15, 2025

gangmuk changed the title ~~Prefix cache and load aware routing policy (e.g., Preble)~~ Prefix cache and load aware routing policy Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix cache and load aware routing policy #677

Prefix cache and load aware routing policy #677

gangmuk commented Feb 14, 2025

Prefix cache and load aware routing policy #677

Prefix cache and load aware routing policy #677

Comments

gangmuk commented Feb 14, 2025

🚀 Feature Description and Motivation

Use Case

Proposed Solution