[Bug]: XQA multi_block_mode crashes with CUDA_ERROR_INVALID_VALUE under concurrent inference (v1.0.0)

### System Info

- Architecture: x86_64
- GPU: 5090 × 2
- TensorRT-LLM version: 1.0.0 (Engine version 1.0.0 confirmed in log)
- TensorRT: 10.x (libnvinfer.so.10)
- OS: Ubuntu 24.04
- Python: 3.12

### Who can help?

_No response_

### Information

- [x] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

**Engine build command:**
```
trtllm-build --checkpoint_dir $trtllm_checkpoint_dir \
             --output_dir $trtllm_engine_dir \
             --max_prompt_embedding_table_size 12288 \
    --max_batch_size 16 \
             --max_beam_width 4 \
             --gemm_plugin fp8 \
             --gather_generation_logits \
             --logits_dtype float32
```

**Runtime configuration:**

- Paged KV cache: enabled (default)
- multi_block_mode: not specified → auto set to true
- batching_strategy: inflight_fused_batching
- max_sequence_len: 32768
- max_num_tokens: 8192
- GPU device ids: 0;1 (DP=2)
- KV cache: 7.45 GiB allocated (139424 tokens, 32 tokens/block)

**Serving:**

`tritonserver --model-repository=$model_repo/ --http-port 45000 --grpc-port 45001`

**Trigger condition:** Send multiple concurrent inference requests (observed crash after ~5500 lines of successful inference logs, when multiple requests are being processed simultaneously in in-flight batching mode).

### Expected behavior

The model should handle concurrent requests up to max_batch_size=16 without crashing, as documented

### actual behavior

The server crashes with `CUDA_ERROR_INVALID_VALUE `inside the XQA JIT kernel launch path when `multi_block_mode=true `(default) under concurrent requests.

**Error message:**

```
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] CUDA driver error in mDriver->cuLaunchKernelEx(&cfg, kernel(), kernelParams, nullptr): CUDA_ERROR_INVALID_VALUE: invalid argument. (../tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/cubinObj.cpp:143)
```
**Full stack trace:**

```
1  0x7fd5cfaddd54 libtensorrt_llm.so(+0x1d5fd54)
2  0x7fd5cff5c71c DecoderXQAImplJIT::runImpl<__half, KVBlockArray>(...) + 2668
3  0x7fd5d01a7779 XqaDispatcher::runImpl<__half, KVBlockArray>(...) + 2201
4  0x7fd5d01a4855 XqaDispatcher::run(...) + 101
5  0x7fd5cfedfb84 AttentionOp::enqueueGeneration<__half, KVBlockArray>(...) + 3988
6  0x7fd59dabf3c0 libnvinfer_plugin_tensorrt_llm.so(+0x3b03c0)
7  0x7fd59dac1443 libnvinfer_plugin_tensorrt_llm.so(+0x3b2443)
8  0x7fd59daa1836 GPTAttentionPlugin::enqueue(...) + 694
9  0x7fe29d9507b5 libnvinfer.so.10(+0x1a877b5)
10 0x7fe29d8a9a58 libnvinfer.so.10(+0x19e0a58)
11 0x7fe29d8ab92a libnvinfer.so.10(+0x19e292a)
12 0x7fd5d0156da0 TllmRuntime::executeContext(...) + 64
13 0x7fd5d0a90179 TrtGptModelInflightBatching::executeContext(...) + 121
14 0x7fd5d0a93fb4 TrtGptModelInflightBatching::executeStep(...) + 1748
15 0x7fd5d0a9477e TrtGptModelInflightBatching::executeBatch(...) + 222
16 0x7fd5d0aa35c7 TrtGptModelInflightBatching::forwardAsync(...) + 2135
17 0x7fd5d0bc143c Executor::Impl::forwardAsync(...) + 444
18 0x7fd5d0bc879e Executor::Impl::executionLoop() + 1550
Signal (6) received.
```
**Workaround**
Setting multi_block_mode=false in the Triton config prevents the crash, but with expected throughput degradation for decode-heavy workloads.

### additional notes

- This is similar to #1256 (XQA crash under concurrency in v0.8.0), but with a different root cause:
 #1256 was a prepare/dispatch ordering race condition in decoderXQARunner.h:177 (fixed)
- This bug is in the JIT kernel launch path (decoderXQAImplJIT/cubinObj.cpp:143), which appears to be new code introduced after v0.8.0
- The crash occurs specifically in DecoderXQAImplJIT::runImpl<__half, KVBlockArray>, suggesting invalid kernel parameters are being passed to cuLaunchKernelEx when paged KV cache (KVBlockArray) is used with multi-block mode under concurrent in-flight batched requests.
- Crash is non-deterministic — depends on timing of concurrent request scheduling.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: XQA multi_block_mode crashes with CUDA_ERROR_INVALID_VALUE under concurrent inference (v1.0.0) #15537

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: XQA multi_block_mode crashes with CUDA_ERROR_INVALID_VALUE under concurrent inference (v1.0.0) #15537

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions