Skip to content

[Bug]: XQA multi_block_mode crashes with CUDA_ERROR_INVALID_VALUE under concurrent inference (v1.0.0) #15537

Description

@xuxiongjie272

System Info

  • Architecture: x86_64
  • GPU: 5090 × 2
  • TensorRT-LLM version: 1.0.0 (Engine version 1.0.0 confirmed in log)
  • TensorRT: 10.x (libnvinfer.so.10)
  • OS: Ubuntu 24.04
  • Python: 3.12

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Engine build command:

trtllm-build --checkpoint_dir $trtllm_checkpoint_dir \
             --output_dir $trtllm_engine_dir \
             --max_prompt_embedding_table_size 12288 \
    --max_batch_size 16 \
             --max_beam_width 4 \
             --gemm_plugin fp8 \
             --gather_generation_logits \
             --logits_dtype float32

Runtime configuration:

  • Paged KV cache: enabled (default)
  • multi_block_mode: not specified → auto set to true
  • batching_strategy: inflight_fused_batching
  • max_sequence_len: 32768
  • max_num_tokens: 8192
  • GPU device ids: 0;1 (DP=2)
  • KV cache: 7.45 GiB allocated (139424 tokens, 32 tokens/block)

Serving:

tritonserver --model-repository=$model_repo/ --http-port 45000 --grpc-port 45001

Trigger condition: Send multiple concurrent inference requests (observed crash after ~5500 lines of successful inference logs, when multiple requests are being processed simultaneously in in-flight batching mode).

Expected behavior

The model should handle concurrent requests up to max_batch_size=16 without crashing, as documented

actual behavior

The server crashes with CUDA_ERROR_INVALID_VALUE inside the XQA JIT kernel launch path when multi_block_mode=true (default) under concurrent requests.

Error message:

terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] CUDA driver error in mDriver->cuLaunchKernelEx(&cfg, kernel(), kernelParams, nullptr): CUDA_ERROR_INVALID_VALUE: invalid argument. (../tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/cubinObj.cpp:143)

Full stack trace:

1  0x7fd5cfaddd54 libtensorrt_llm.so(+0x1d5fd54)
2  0x7fd5cff5c71c DecoderXQAImplJIT::runImpl<__half, KVBlockArray>(...) + 2668
3  0x7fd5d01a7779 XqaDispatcher::runImpl<__half, KVBlockArray>(...) + 2201
4  0x7fd5d01a4855 XqaDispatcher::run(...) + 101
5  0x7fd5cfedfb84 AttentionOp::enqueueGeneration<__half, KVBlockArray>(...) + 3988
6  0x7fd59dabf3c0 libnvinfer_plugin_tensorrt_llm.so(+0x3b03c0)
7  0x7fd59dac1443 libnvinfer_plugin_tensorrt_llm.so(+0x3b2443)
8  0x7fd59daa1836 GPTAttentionPlugin::enqueue(...) + 694
9  0x7fe29d9507b5 libnvinfer.so.10(+0x1a877b5)
10 0x7fe29d8a9a58 libnvinfer.so.10(+0x19e0a58)
11 0x7fe29d8ab92a libnvinfer.so.10(+0x19e292a)
12 0x7fd5d0156da0 TllmRuntime::executeContext(...) + 64
13 0x7fd5d0a90179 TrtGptModelInflightBatching::executeContext(...) + 121
14 0x7fd5d0a93fb4 TrtGptModelInflightBatching::executeStep(...) + 1748
15 0x7fd5d0a9477e TrtGptModelInflightBatching::executeBatch(...) + 222
16 0x7fd5d0aa35c7 TrtGptModelInflightBatching::forwardAsync(...) + 2135
17 0x7fd5d0bc143c Executor::Impl::forwardAsync(...) + 444
18 0x7fd5d0bc879e Executor::Impl::executionLoop() + 1550
Signal (6) received.

Workaround
Setting multi_block_mode=false in the Triton config prevents the crash, but with expected throughput degradation for decode-heavy workloads.

additional notes

  • This is similar to Code Llama 70b triton crashes with XQA #1256 (XQA crash under concurrency in v0.8.0), but with a different root cause:
    Code Llama 70b triton crashes with XQA #1256 was a prepare/dispatch ordering race condition in decoderXQARunner.h:177 (fixed)
  • This bug is in the JIT kernel launch path (decoderXQAImplJIT/cubinObj.cpp:143), which appears to be new code introduced after v0.8.0
  • The crash occurs specifically in DecoderXQAImplJIT::runImpl<__half, KVBlockArray>, suggesting invalid kernel parameters are being passed to cuLaunchKernelEx when paged KV cache (KVBlockArray) is used with multi-block mode under concurrent in-flight batched requests.
  • Crash is non-deterministic — depends on timing of concurrent request scheduling.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions