[None][fix] Fix encoder-decoder beam search corruption via per-slot fragmentPointerDevice#15461
[None][fix] Fix encoder-decoder beam search corruption via per-slot fragmentPointerDevice#15461achartier wants to merge 8 commits into
Conversation
📝 WalkthroughWalkthroughThree correctness fixes for encoder-decoder beam search: ChangesEncoder-Decoder Beam Search Correctness
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/tests/unit_tests/batch_manager/encDecBeamSearchTest.cpp`:
- Around line 194-198: The current test at the copyGenerationLogits call site
uses a dummy cache and the direct-copy implementation path, which does not
exercise the fragmentPointerDevice slot isolation logic needed to detect race
conditions. Modify or add a test case that exercises the cached-fragment merge
path (using real cache data rather than a dummy cache) with at least two
concurrent requests/slots, and validate that fragmentPointerDevice pointers are
not clobbered across different slot allocations to catch regressions of the
maxBatchSize x kCACHE_LENGTH change.
- Around line 63-70: The test CopyBlockOffsetsAllBeamsShareBeam0Blocks is too
lenient because it relies on the allocator already sharing blocks, making
equality across beams an unreliable check that could pass even with the bug
present. Modify the test to explicitly set per-beam source rows to different
values before calling copyBlockOffsets, then assert that the output rows are all
normalized to beam 0. This forces a discriminating scenario where the old buggy
behavior would fail (keeping per-beam differences) while the fixed behavior
would pass (normalizing all to beam 0). Apply the same principle to the related
test mentioned around lines 115-140.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d6e2d2c4-bf3b-49ae-964b-984a43ebb5dc
📒 Files selected for processing (7)
cpp/include/tensorrt_llm/batch_manager/runtimeBuffers.hcpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/batch_manager/runtimeBuffers.cppcpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cppcpp/tensorrt_llm/batch_manager/utils/inflightBatchingUtils.cppcpp/tests/unit_tests/batch_manager/CMakeLists.txtcpp/tests/unit_tests/batch_manager/encDecBeamSearchTest.cpp
|
/bot run |
|
PR_Github #54871 [ run ] triggered by Bot. Commit: |
|
PR_Github #54871 [ run ] completed with state
|
826b435 to
72c55d5
Compare
|
/bot run |
|
PR_Github #54877 [ run ] triggered by Bot. Commit: |
|
PR_Github #54877 [ run ] completed with state
|
|
/bot run |
|
PR_Github #54895 [ run ] triggered by Bot. Commit: |
|
PR_Github #54895 [ run ] completed with state
|
In KVCacheManager::copyBlockOffsets, when building the cross-attention KV cache offset table for encoder-decoder models (e.g. Whisper), each beam of a request was assigned separate physical blocks. However, the encoder context phase only writes encoder features to beam-0's blocks (contextBeamWidth=1 means only slot 0 per request is addressed). Beams 1..N-1 pointed to uninitialised GPU memory, causing degenerate repetitive output such as "happ happ happ" during beam-search decoding. Fix: when isCrossKv(), always use beam-0's source block IDs and block count for every beam. The encoder output is identical for all beams of a request, so sharing the same physical blocks is semantically correct. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
When concurrent requests with different beam widths (e.g. beam=1 and beam=5) triggered a verifyRequests() exception in TrtGptModelInflightBatching:: forwardAsync(), the exception handler freed the KV-cache sequence slots via terminateRequest() but did not remove the affected request IDs from mInflightReqIds. The subsequent changeBeamWidth() call checks TLLM_CHECK(mInflightReqIds.empty()) and would abort the process. Fix: in the exception catch block, erase each active request from mInflightReqIds before calling terminateRequest(), then call changeBeamWidth(mOperatingBeamWidth) to reset RuntimeBuffers and DecoderState so the next batch starts from a clean state. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Two tests in encDecBeamSearchTest.cpp: CrossKvBeamSharingTest/CopyBlockOffsetsAllBeamsShareBeam0Blocks Verifies that KVCacheManager::copyBlockOffsets for a cross-KV cache (CacheType::kCROSS, beam width > 1) writes the same physical block IDs into every beam slot. In the simple context-only setup the allocator shares blocks across beams, so this acts as a sanity check for the correct invariant. CopyGenerationLogitsTest/DirectCopyPlacesEachBeamStepAtCorrectHostOffset Verifies that copyGenerationLogits correctly places each (beam, step) logit from its GPU fragment tensor into the right slot of the host logits buffer. Making copyGenerationLogits a no-op produces 56 assertion failures. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
…tPointerDevice
The mergeLogitsFragmentsKernel reads fragment GPU addresses from fragmentPointerDevice
([kCACHE_LENGTH]). When multiple requests in the same batch flush their logits fragments
sequentially, each overwrites that shared device buffer before the previous kernel has
committed its result on the stream — a race that caused intermittent degenerate output
("happ happ happ") with gather_generation_logits=True.
Fix: resize fragmentPointerDevice from [kCACHE_LENGTH] to [maxBatchSize, kCACHE_LENGTH],
matching the layout of fragmentPointerHost. Add getFragmentPointerDevice() to return the
current-workIdx row without advancing it, and call it before getFragmentPointerHost() so
both use the same per-slot row. Each request in the batch now has its own device-side
pointer array, eliminating cross-request clobbering.
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
…s PR Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
…ccuracy - Replace getFragmentPointerDevice()/getFragmentPointerHost() with a single getFragmentPointerSlot() that returns both rows atomically and advances workIdx once, making it impossible to call them in the wrong order (gripe 2). - Add TLLM_CHECK that cacheBlockIds is non-empty on the cross-KV path before accessing cacheBlockIds[0] (gripe 5). - Fix the Fix-3 description in encDecBeamSearchTest.cpp: this branch uses a repaired kernel (per-slot fragmentPointerDevice), not the direct-copy path described by the stale comment (gripe 3). Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Fix-1 test (CrossKvBeamSharingTest): previously the allocator happened to share blocks across beams, making equality across beam slots a tautology that passed both with and without the fix. Now the test explicitly writes distinct per-beam sentinel values into the source cacheBlockIndices tensor before calling copyBlockOffsets. The old code (srcBeamIdx = beamIdx) leaves beams 1..N-1 with their own different values; the fixed code (srcBeamIdx = 0 when isCrossKv()) normalises all slots to beam-0. Confirmed: test fails when the fix is reverted. Fix-3 test (CopyGenerationLogitsTest): the previous test used a null dummy cache and did not exercise the mergeLogitsFragmentsKernel path. Replaced with a test that allocates a real GenerationLogitsCache (transposedLogits, fragmentPointerDevice, fragmentPointerHost), creates fragments as pinned-memory slices of cache.logits with known per-(step, beam) values, and verifies both requests produce the correct host layout via the actual kernel merge path. Running two back-to-back flushes also validates that each uses a separate fragmentPointerDevice slot (slot isolation). Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
… handler In pipeline-parallel multi-micro-batch mode, requests from other micro-batches may still be tracked in mInflightReqIds after the error handler erases only the current activeRequests. changeBeamWidth asserts mInflightReqIds.empty(), so calling it unconditionally would throw (caught by the inner try-catch) and skip the intended buffer reset. Add an explicit guard so the reset is skipped when other micro-batches are still in-flight. The next successful forwardAsync iteration will perform the reset via the normal changeBeamWidth call in verifyRequests once the set is clear. This makes the skip explicit rather than relying on TLLM_CHECK as control flow. Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
72c55d5 to
b8b6670
Compare
|
/bot run |
|
PR_Github #54936 [ run ] triggered by Bot. Commit: |
|
PR_Github #54936 [ run ] completed with state
|
|
/bot run |
|
PR_Github #54948 [ run ] triggered by Bot. Commit: |
Summary by CodeRabbit
Bug Fixes
Tests
Description
Fixes three bugs in the TRT backend (
ModelRunnerCpp) that caused corrupted output when concurrent requests with different beam widths (e.g. beam=1 and beam=5) were submitted to the same executor on encoder-decoder models (Whisper) with beam search.Bug 1 — Cross-KV block sharing
KVCacheManager::copyBlockOffsetsassigned separate physical blocks to each beam of a cross-attention sequence. The TRT encoder runs withcontextBeamWidth=1so only beam-0's blocks were ever populated with encoder features. Beams 1..N-1 attended to uninitialised GPU memory, producing degenerate repetitive output.Fix: when
isCrossKv()is true, always use beam-0's source block IDs and count for all beams. The encoder output is identical for all beams of a request.Bug 2 —
mInflightReqIdsleak on mixed-beam exceptionWhen concurrent requests with different beam widths triggered the
verifyRequests()exception inTrtGptModelInflightBatching::forwardAsync, the exception handler freed the KV-cache but did not remove request IDs frommInflightReqIds. The subsequentchangeBeamWidth()call asserts this set is empty and aborted.Fix: erase affected IDs from
mInflightReqIdsand callchangeBeamWidth(mOperatingBeamWidth)in the exception catch block.Bug 3 —
gather_generation_logitscorruption via sharedfragmentPointerDevicemergeLogitsFragmentsKernelreads fragment GPU addresses fromfragmentPointerDevice(shape[kCACHE_LENGTH]). When multiple requests in the same batch flush their logits fragments sequentially, each overwrites that shared device buffer before the previous kernel has committed — a race causing intermittent degenerate output withgather_generation_logits=True.Fix: resize
fragmentPointerDevicefrom[kCACHE_LENGTH]to[maxBatchSize, kCACHE_LENGTH], matching the layout offragmentPointerHost. AddgetFragmentPointerSlot()which returns both host and device rows atomically and advancesworkIdxonce, making call-order errors structurally impossible. Each request in the batch now has its own device-side pointer array.Test Coverage
C++ unit tests (
cpp/tests/unit_tests/batch_manager/encDecBeamSearchTest.cpp):CrossKvBeamSharingTest/CopyBlockOffsetsAllBeamsShareBeam0Blocks— verifies Fix 1: all beam slots in the cross-KV offset table equal beam-0.CopyGenerationLogitsTest/DirectCopyPlacesEachBeamStepAtCorrectHostOffset— verifies Fix 3's host-layout invariant: each (beam, step) logit lands at the correct host offset and fragments are cleared afterward.Validation on Whisper large-v3 with concurrent mixed-beam execution:
gather_generation_logitsgather_generation_logits=TruePR Checklist
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.