Skip to content

[Feature]: Add TRTLLM-gen FMHA Dense paged GQA generation cubins for P16 #15339

Description

@elwhyjay

🚀 The feature, motivation and pitch

FlashInfer is adding non-causal TRTLLM-gen paged GQA decode(flashinfer-ai/flashinfer#3629). That path selects TRTLLM-gen FMHA generation with PagedKv layout and Dense mask. For page size 16, TensorRT-LLM does not currently ship matching precompiled FMHA cubins or Dense metadata rows.

Requested coverage:

  • qkvLayout = PagedKv
  • maskType = Dense
  • kernelType = Generation
  • numTokensPerPage = 16
  • headDimQk = headDimV in {64, 128, 256}
  • tileSizeQ in {8, 16}
  • tileSizeKv = 128
  • Blackwell SM100/SM103, matching the existing TRTLLM-gen FMHA cubin set

Why P16 matters:

  • numTokensPerPage is selected from the physical paged KV cache layout.
  • Changing the downstream test to page size 32 selects a different FMHA metadata key and does not validate the P16 runtime path.
  • FlashInfer supports page size 16 paged KV caches in its public decode path. Non-causal GQA should not require changing the cache layout only to match the currently shipped cubin set.

There is also a metadata issue for existing Dense-named P32 generation entries: the function names contain PagedKvDense, but the rows are indexed with maskType=Causal. Please index those rows as Dense when refreshing the metadata.

Alternatives

FlashInfer can keep the new non-causal GQA tests as xfail, but it cannot remove the xfail or validate the Dense runtime path until TensorRT-LLM ships the P16 Dense cubins and metadata.

Changing the downstream tests to page size 32 would only test a different runtime shape and would not validate the P16 path used by the reported case.

Additional context

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.feature requestNew feature or request. This includes new model, dtype, functionality support

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions