Skip to content

[Bug]: chunked-prefill fallback does not synchronize llm_args.enable_chunked_prefill #15463

Description

@DhineshPonnarasan

System Info

  • CPU architecture: N/A (code inspection issue)
  • GPU: N/A (issue identified through source analysis)
  • TensorRT-LLM branch: main
  • TensorRT-LLM commit: current main branch at time of investigation
  • OS: N/A

Additional information:

  • This issue was identified through source-code inspection and call-chain analysis.
  • No specific hardware is required to observe the behavior.
  • The report concerns state synchronization in the PyTorch executor initialization path.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Summary

While reviewing tensorrt_llm/_torch/pyexecutor/py_executor_creator.py, I noticed that chunked prefill can be disabled through fallback logic, but llm_args.enable_chunked_prefill is never updated to reflect the effective runtime state.

The relevant flow is:

enable_chunked_context = llm_args.enable_chunked_prefill

Later, chunked prefill may be disabled through:

  • FLASHINFER_STAR_ATTENTION fallback
  • MLA unsupported-SM fallback

For example:

enable_chunked_context = False

and:

model_engine.attn_runtime_features.chunked_prefill = False

However:

llm_args.enable_chunked_prefill

is never updated.

As a result, the runtime state and user-facing configuration can diverge.

Steps to reproduce the behavior

  1. Enable chunked prefill.

  2. Trigger a fallback path that disables chunked prefill at runtime (for example, an unsupported MLA SM configuration or FLASHINFER_STAR_ATTENTION).

  3. Observe that:

    • runtime chunked prefill is disabled
    • llm_args.enable_chunked_prefill remains True

Minimal example

Relevant pattern:

enable_chunked_context = llm_args.enable_chunked_prefill

...

enable_chunked_context = False
model_engine.attn_runtime_features.chunked_prefill = False

# llm_args.enable_chunked_prefill remains unchanged

Expected behavior

When chunked prefill is disabled through fallback logic, the effective runtime state and llm_args.enable_chunked_prefill should remain synchronized.

After initialization:

llm_args.enable_chunked_prefill

should accurately reflect whether chunked prefill is actually enabled.

actual behavior

The runtime disables chunked prefill through fallback logic, but:

llm_args.enable_chunked_prefill

remains True.

This creates a state mismatch where:

  • runtime chunked prefill is disabled
  • user-facing configuration still reports chunked prefill as enabled

Downstream validation and feature-status reporting may therefore observe stale state.

additional notes

This appears to be a synchronization issue rather than an intentional design choice.

Notably, create_py_executor() is already expected to mutate portions of llm_args, and the nearby kv_cache_config.enable_block_reuse logic keeps runtime state and configuration synchronized.

I was unable to find an existing issue or PR tracking this specific enable_chunked_prefill synchronization problem.

Potential areas to investigate:

  • FLASHINFER_STAR_ATTENTION fallback path
  • MLA unsupported-SM fallback path
  • downstream validation that reads llm_args.enable_chunked_prefill
  • feature-status reporting based on llm_args

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Pytorch<NV>Pytorch backend related issuesbugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions