[Feature]: Add TRTLLM-gen FMHA Dense paged GQA generation cubins for P16

### 🚀 The feature, motivation and pitch

FlashInfer is adding non-causal TRTLLM-gen paged GQA decode(flashinfer-ai/flashinfer#3629). That path selects TRTLLM-gen FMHA generation with PagedKv layout and Dense mask. For page size 16, TensorRT-LLM does not currently ship matching precompiled FMHA cubins or Dense metadata rows.

Requested coverage:
- qkvLayout = PagedKv
- maskType = Dense
- kernelType = Generation
- numTokensPerPage = 16
- headDimQk = headDimV in {64, 128, 256}
- tileSizeQ in {8, 16}
- tileSizeKv = 128
- Blackwell SM100/SM103, matching the existing TRTLLM-gen FMHA cubin set

Why P16 matters:
- numTokensPerPage is selected from the physical paged KV cache layout.
- Changing the downstream test to page size 32 selects a different FMHA metadata key and does not validate the P16 runtime path.
- FlashInfer supports page size 16 paged KV caches in its public decode path. Non-causal GQA should not require changing the cache layout only to match the currently shipped cubin set.

There is also a metadata issue for existing Dense-named P32 generation entries: the function names contain PagedKvDense, but the rows are indexed with maskType=Causal. Please index those rows as Dense when refreshing the metadata.

### Alternatives

FlashInfer can keep the new non-causal GQA tests as xfail, but it cannot remove the xfail or validate the Dense runtime path until TensorRT-LLM ships the P16 Dense cubins and metadata.

Changing the downstream tests to page size 32 would only test a different runtime shape and would not validate the P16 path used by the reported case.

### Additional context




### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Add TRTLLM-gen FMHA Dense paged GQA generation cubins for P16 #15339

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature]: Add TRTLLM-gen FMHA Dense paged GQA generation cubins for P16 #15339

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions