Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion crd-ref-docs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

processor:
ignoreTypes:
- "(InferencePool|InferenceObjective|InferencePoolImport)List$"
- "(InferencePool|InferenceObjective|InferencePoolImport|InferenceModelRewrite)List$"
# RE2 regular expressions describing type fields that should be excluded from the generated documentation.
ignoreFields:
- "TypeMeta$"
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@ nav:
- InferencePool: api-types/inferencepool.md
- InferenceObjective: api-types/inferenceobjective.md
- InferencePoolImport: api-types/inferencepoolimport.md
- InferenceModelRewrite: api-types/inferencemodelrewrite.md
- Enhancements:
- Overview: enhancements/overview.md
- Contributing:
Expand Down
95 changes: 95 additions & 0 deletions site-src/api-types/inferencemodelrewrite.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Inference Model Rewrite

??? example "Alpha since v1.2.1"

The `InferenceModelRewrite` resource is alpha and may have breaking changes in
future releases of the API.

## Background

The **InferenceModelRewrite** resource allows platform administrators and model owners to control how inference requests are routed to specific models within an Inference Pool.
This capability is essential for managing model lifecycles without disrupting client applications.

## Usages

* **Model Aliasing**: Map a model name in the request body (e.g., `food-review`) to a specific version (e.g., `food-review-v1`).
* **Generic Fallbacks**: Redirect unknown model requests to a default model.
* **Traffic Splitting**: Gradually roll out new model versions (Canary deployment) by splitting traffic between two models based on percentage weights.

## Spec

The full spec of the InferenceModelRewrite is defined [here](/reference/x-v1a2-spec/#inferencemodelrewrite).

## Usage Examples

### Model Aliasing

Map a virtual model name (e.g., `food-review`) to a specific backend model version (e.g., `food-review-v1`).

```yaml
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModelRewrite
metadata:
name: food-review-alias
spec:
poolRef:
group: inference.networking.k8s.io
name: vllm-llama3-8b-instruct
rules:
- matches:
- model:
type: Exact
value: food-review
targets:
- modelRewrite: "food-review-v1"
```

### Generic (Wildcard) Rewrites

Redirect any request with an unrecognized or unspecified model name to a default safe model. An empty `matches` list implies that the rule applies to **all** requests not matched by previous rules.

```yaml
apiVersion: inference.networking.k8s.io/v1alpha2
kind: InferenceModelRewrite
metadata:
name: generic-fallback
spec:
poolRef:
group: inference.networking.k8s.io
name: vllm-llama3-8b-instruct
rules:
- matches: [] # Empty means this rule matches everything
targets:
- modelRewrite: "meta-llama/Llama-3.1-8B-Instruct"
```

### Traffic Splitting (Canary Rollout)

Divide incoming traffic for a single model name across multiple backend models. This is useful for A/B testing or gradual rollouts.

```yaml
apiVersion: inference.networking.k8s.io/v1alpha2
kind: InferenceModelRewrite
metadata:
name: food-review-canary
spec:
poolRef:
group: inference.networking.k8s.io
name: vllm-llama3-8b-instruct
rules:
- matches:
- model:
type: Exact
value: food-review
targets:
- modelRewrite: "food-review-v1"
weight: 90
- modelRewrite: "food-review-v2"
weight: 10
```

## Limitations

1. **Status Reporting**: Currently, `InferenceModelRewrite` is simply a config read-only CR. It does not report status conditions (e.g., Valid or Ready) in the CRD status field.
2. **Scheduler Assumptions**: Traffic splitting occurs before the scheduling algorithm. The system assumes that all model servers within the referenced `InferencePool` are capable of serving the target models. If a model is missing from a specific server in the pool, requests routed to it may fail.
3. **Splitting algorithm**: The current traffic split is weighted-random.
Loading