[Feature]: Expose per-request scheduling priority in the trtllm-serve OpenAI frontend

The feature, motivation and pitch

The LLM API already supports per-request scheduling priority: `LLM.generate_async(..., priority=...)` (range `[0, 1]`, default `DEFAULT_REQUEST_PRIORITY = 0.5`), honored by the PyTorch backend when the engine runs with `scheduler_config.waiting_queue_policy=priority`.

However, `trtllm-serve`'s OpenAI-compatible frontend does not expose this — there is no way for an HTTP client to request elevated (or reduced) scheduling priority on `/v1/chat/completions` or `/v1/completions`.

We (DeepInfra) run `trtllm-serve` in production and want to offer tiered scheduling: latency-sensitive traffic gets higher priority, batch/background traffic gets lower priority, on the same deployment. This only needs a small frontend change: accept an optional `priority: float` extra param on `ChatCompletionRequest` / `CompletionRequest` and forward it to `generate_async`.

We validated the underlying scheduler behavior on a 4xGB300 deployment at 160-way saturation: requests sent with priority 1.0 saw p50 TTFT of 0.947s vs 2.857s for priority 0.0, while an equal-priority control run showed no gap.

I have a small PR ready (~25 lines, two files in `tensorrt_llm/serve/`) and will link it here.

### Alternatives

- Running separate deployments per traffic tier — wastes capacity and loses KV-cache sharing.
- Mapping the OpenAI `service_tier` field to priority — rejected because `service_tier` has OpenAI-specific billing semantics; a dedicated numeric extra param matches the existing LLM API surface (`priority` in `[0,1]`) exactly. Other inference servers (e.g. vLLM) expose the same concept as a `priority` extra param in their OpenAI frontends.

### Additional context

Priority is only honored when the engine is launched with `waiting_queue_policy: priority`; under the default FCFS policy the field is accepted and ignored, so the change is backward-compatible.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Expose per-request scheduling priority in the trtllm-serve OpenAI frontend #15327

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature]: Expose per-request scheduling priority in the trtllm-serve OpenAI frontend #15327

Description

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions