The feature, motivation and pitch
The LLM API already supports per-request scheduling priority: LLM.generate_async(..., priority=...) (range [0, 1], default DEFAULT_REQUEST_PRIORITY = 0.5), honored by the PyTorch backend when the engine runs with scheduler_config.waiting_queue_policy=priority.
However, trtllm-serve's OpenAI-compatible frontend does not expose this — there is no way for an HTTP client to request elevated (or reduced) scheduling priority on /v1/chat/completions or /v1/completions.
We (DeepInfra) run trtllm-serve in production and want to offer tiered scheduling: latency-sensitive traffic gets higher priority, batch/background traffic gets lower priority, on the same deployment. This only needs a small frontend change: accept an optional priority: float extra param on ChatCompletionRequest / CompletionRequest and forward it to generate_async.
We validated the underlying scheduler behavior on a 4xGB300 deployment at 160-way saturation: requests sent with priority 1.0 saw p50 TTFT of 0.947s vs 2.857s for priority 0.0, while an equal-priority control run showed no gap.
I have a small PR ready (~25 lines, two files in tensorrt_llm/serve/) and will link it here.
Alternatives
- Running separate deployments per traffic tier — wastes capacity and loses KV-cache sharing.
- Mapping the OpenAI
service_tier field to priority — rejected because service_tier has OpenAI-specific billing semantics; a dedicated numeric extra param matches the existing LLM API surface (priority in [0,1]) exactly. Other inference servers (e.g. vLLM) expose the same concept as a priority extra param in their OpenAI frontends.
Additional context
Priority is only honored when the engine is launched with waiting_queue_policy: priority; under the default FCFS policy the field is accepted and ignored, so the change is backward-compatible.
The feature, motivation and pitch
The LLM API already supports per-request scheduling priority:
LLM.generate_async(..., priority=...)(range[0, 1], defaultDEFAULT_REQUEST_PRIORITY = 0.5), honored by the PyTorch backend when the engine runs withscheduler_config.waiting_queue_policy=priority.However,
trtllm-serve's OpenAI-compatible frontend does not expose this — there is no way for an HTTP client to request elevated (or reduced) scheduling priority on/v1/chat/completionsor/v1/completions.We (DeepInfra) run
trtllm-servein production and want to offer tiered scheduling: latency-sensitive traffic gets higher priority, batch/background traffic gets lower priority, on the same deployment. This only needs a small frontend change: accept an optionalpriority: floatextra param onChatCompletionRequest/CompletionRequestand forward it togenerate_async.We validated the underlying scheduler behavior on a 4xGB300 deployment at 160-way saturation: requests sent with priority 1.0 saw p50 TTFT of 0.947s vs 2.857s for priority 0.0, while an equal-priority control run showed no gap.
I have a small PR ready (~25 lines, two files in
tensorrt_llm/serve/) and will link it here.Alternatives
service_tierfield to priority — rejected becauseservice_tierhas OpenAI-specific billing semantics; a dedicated numeric extra param matches the existing LLM API surface (priorityin[0,1]) exactly. Other inference servers (e.g. vLLM) expose the same concept as apriorityextra param in their OpenAI frontends.Additional context
Priority is only honored when the engine is launched with
waiting_queue_policy: priority; under the default FCFS policy the field is accepted and ignored, so the change is backward-compatible.