Skip to content

[None][perf] enable paged context for skip-softmax fp8 kv#15442

Draft
bobboli wants to merge 1 commit into
NVIDIA:mainfrom
bobboli:lbo/skip-softmax-fp8-paged-context
Draft

[None][perf] enable paged context for skip-softmax fp8 kv#15442
bobboli wants to merge 1 commit into
NVIDIA:mainfrom
bobboli:lbo/skip-softmax-fp8-paged-context

Conversation

@bobboli

@bobboli bobboli commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Force paged-context FMHA when skip-softmax attention is used with FP8 KV cache.
  • This keeps context QKV on the paged-context preprocessing path so FP8 KV-cache skip-softmax dispatch does not fall back to BF16 packed context.

Testing

  • python3 -m py_compile tensorrt_llm/_torch/attention_backend/trtllm.py
  • git diff --check HEAD^ HEAD

Not run

  • pytest/runtime verification in tekit4. Local Python is missing pluggy/torch, and I did not build TensorRT-LLM in this checkout.
  • Full-file ruff on trtllm.py reports pre-existing style violations unrelated to this change.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
@bobboli bobboli force-pushed the lbo/skip-softmax-fp8-paged-context branch from 2d81eae to b93a812 Compare June 17, 2026 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant