Skip to content

[Feature]: Expose per-request scheduling priority in the trtllm-serve OpenAI frontend #15327

Description

@sopwg612

The feature, motivation and pitch

The LLM API already supports per-request scheduling priority: LLM.generate_async(..., priority=...) (range [0, 1], default DEFAULT_REQUEST_PRIORITY = 0.5), honored by the PyTorch backend when the engine runs with scheduler_config.waiting_queue_policy=priority.

However, trtllm-serve's OpenAI-compatible frontend does not expose this — there is no way for an HTTP client to request elevated (or reduced) scheduling priority on /v1/chat/completions or /v1/completions.

We (DeepInfra) run trtllm-serve in production and want to offer tiered scheduling: latency-sensitive traffic gets higher priority, batch/background traffic gets lower priority, on the same deployment. This only needs a small frontend change: accept an optional priority: float extra param on ChatCompletionRequest / CompletionRequest and forward it to generate_async.

We validated the underlying scheduler behavior on a 4xGB300 deployment at 160-way saturation: requests sent with priority 1.0 saw p50 TTFT of 0.947s vs 2.857s for priority 0.0, while an equal-priority control run showed no gap.

I have a small PR ready (~25 lines, two files in tensorrt_llm/serve/) and will link it here.

Alternatives

  • Running separate deployments per traffic tier — wastes capacity and loses KV-cache sharing.
  • Mapping the OpenAI service_tier field to priority — rejected because service_tier has OpenAI-specific billing semantics; a dedicated numeric extra param matches the existing LLM API surface (priority in [0,1]) exactly. Other inference servers (e.g. vLLM) expose the same concept as a priority extra param in their OpenAI frontends.

Additional context

Priority is only honored when the engine is launched with waiting_queue_policy: priority; under the default FCFS policy the field is accepted and ignored, so the change is backward-compatible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    LLM API<NV>High-level LLM Python API & tools (e.g., trtllm-llmapi-launch) for TRTLLM inference/workflows.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions