System Info
- CPU architecture: N/A (code inspection issue)
- GPU: N/A (issue identified through source analysis)
- TensorRT-LLM branch: main
- TensorRT-LLM commit: current main branch at time of investigation
- OS: N/A
Additional information:
- This issue was identified through source-code inspection and call-chain analysis.
- No specific hardware is required to observe the behavior.
- The report concerns state synchronization in the PyTorch executor initialization path.
Who can help?
No response
Information
Tasks
Reproduction
Summary
While reviewing tensorrt_llm/_torch/pyexecutor/py_executor_creator.py, I noticed that chunked prefill can be disabled through fallback logic, but llm_args.enable_chunked_prefill is never updated to reflect the effective runtime state.
The relevant flow is:
enable_chunked_context = llm_args.enable_chunked_prefill
Later, chunked prefill may be disabled through:
FLASHINFER_STAR_ATTENTION fallback
- MLA unsupported-SM fallback
For example:
enable_chunked_context = False
and:
model_engine.attn_runtime_features.chunked_prefill = False
However:
llm_args.enable_chunked_prefill
is never updated.
As a result, the runtime state and user-facing configuration can diverge.
Steps to reproduce the behavior
-
Enable chunked prefill.
-
Trigger a fallback path that disables chunked prefill at runtime (for example, an unsupported MLA SM configuration or FLASHINFER_STAR_ATTENTION).
-
Observe that:
- runtime chunked prefill is disabled
llm_args.enable_chunked_prefill remains True
Minimal example
Relevant pattern:
enable_chunked_context = llm_args.enable_chunked_prefill
...
enable_chunked_context = False
model_engine.attn_runtime_features.chunked_prefill = False
# llm_args.enable_chunked_prefill remains unchanged
Expected behavior
When chunked prefill is disabled through fallback logic, the effective runtime state and llm_args.enable_chunked_prefill should remain synchronized.
After initialization:
llm_args.enable_chunked_prefill
should accurately reflect whether chunked prefill is actually enabled.
actual behavior
The runtime disables chunked prefill through fallback logic, but:
llm_args.enable_chunked_prefill
remains True.
This creates a state mismatch where:
- runtime chunked prefill is disabled
- user-facing configuration still reports chunked prefill as enabled
Downstream validation and feature-status reporting may therefore observe stale state.
additional notes
This appears to be a synchronization issue rather than an intentional design choice.
Notably, create_py_executor() is already expected to mutate portions of llm_args, and the nearby kv_cache_config.enable_block_reuse logic keeps runtime state and configuration synchronized.
I was unable to find an existing issue or PR tracking this specific enable_chunked_prefill synchronization problem.
Potential areas to investigate:
FLASHINFER_STAR_ATTENTION fallback path
- MLA unsupported-SM fallback path
- downstream validation that reads
llm_args.enable_chunked_prefill
- feature-status reporting based on
llm_args
Before submitting a new issue...
System Info
Additional information:
Who can help?
No response
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Summary
While reviewing
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py, I noticed that chunked prefill can be disabled through fallback logic, butllm_args.enable_chunked_prefillis never updated to reflect the effective runtime state.The relevant flow is:
Later, chunked prefill may be disabled through:
FLASHINFER_STAR_ATTENTIONfallbackFor example:
and:
However:
is never updated.
As a result, the runtime state and user-facing configuration can diverge.
Steps to reproduce the behavior
Enable chunked prefill.
Trigger a fallback path that disables chunked prefill at runtime (for example, an unsupported MLA SM configuration or
FLASHINFER_STAR_ATTENTION).Observe that:
llm_args.enable_chunked_prefillremainsTrueMinimal example
Relevant pattern:
Expected behavior
When chunked prefill is disabled through fallback logic, the effective runtime state and
llm_args.enable_chunked_prefillshould remain synchronized.After initialization:
should accurately reflect whether chunked prefill is actually enabled.
actual behavior
The runtime disables chunked prefill through fallback logic, but:
remains
True.This creates a state mismatch where:
Downstream validation and feature-status reporting may therefore observe stale state.
additional notes
This appears to be a synchronization issue rather than an intentional design choice.
Notably,
create_py_executor()is already expected to mutate portions ofllm_args, and the nearbykv_cache_config.enable_block_reuselogic keeps runtime state and configuration synchronized.I was unable to find an existing issue or PR tracking this specific
enable_chunked_prefillsynchronization problem.Potential areas to investigate:
FLASHINFER_STAR_ATTENTIONfallback pathllm_args.enable_chunked_prefillllm_argsBefore submitting a new issue...