You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Trigger condition: Send multiple concurrent inference requests (observed crash after ~5500 lines of successful inference logs, when multiple requests are being processed simultaneously in in-flight batching mode).
Expected behavior
The model should handle concurrent requests up to max_batch_size=16 without crashing, as documented
actual behavior
The server crashes with CUDA_ERROR_INVALID_VALUE inside the XQA JIT kernel launch path when multi_block_mode=true (default) under concurrent requests.
Error message:
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
what(): [TensorRT-LLM][ERROR] CUDA driver error in mDriver->cuLaunchKernelEx(&cfg, kernel(), kernelParams, nullptr): CUDA_ERROR_INVALID_VALUE: invalid argument. (../tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/cubinObj.cpp:143)
Workaround
Setting multi_block_mode=false in the Triton config prevents the crash, but with expected throughput degradation for decode-heavy workloads.
This bug is in the JIT kernel launch path (decoderXQAImplJIT/cubinObj.cpp:143), which appears to be new code introduced after v0.8.0
The crash occurs specifically in DecoderXQAImplJIT::runImpl<__half, KVBlockArray>, suggesting invalid kernel parameters are being passed to cuLaunchKernelEx when paged KV cache (KVBlockArray) is used with multi-block mode under concurrent in-flight batched requests.
Crash is non-deterministic — depends on timing of concurrent request scheduling.
Before submitting a new issue...
Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
System Info
Who can help?
No response
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Engine build command:
Runtime configuration:
Serving:
tritonserver --model-repository=$model_repo/ --http-port 45000 --grpc-port 45001Trigger condition: Send multiple concurrent inference requests (observed crash after ~5500 lines of successful inference logs, when multiple requests are being processed simultaneously in in-flight batching mode).
Expected behavior
The model should handle concurrent requests up to max_batch_size=16 without crashing, as documented
actual behavior
The server crashes with
CUDA_ERROR_INVALID_VALUEinside the XQA JIT kernel launch path whenmulti_block_mode=true(default) under concurrent requests.Error message:
Full stack trace:
Workaround
Setting multi_block_mode=false in the Triton config prevents the crash, but with expected throughput degradation for decode-heavy workloads.
additional notes
Code Llama 70b triton crashes with XQA #1256 was a prepare/dispatch ordering race condition in decoderXQARunner.h:177 (fixed)
Before submitting a new issue...