[https://nvbugs/6329052][fix] Add `attn_backend: FLASHINFER` and `model_kwargs: {num_hidden_layers: 4}` to… by tensorrt-cicd · Pull Request #15464 · NVIDIA/TensorRT-LLM

tensorrt-cicd · 2026-06-17T21:05:53Z

Summary

Root cause: Two full-size DeepSeek-V3-Lite/bf16 worker copies (~38 GiB each) can't share a 44 GiB L40S, and the TRTLLM attn backend asserts FMHA support for DeepSeek MLA which is unavailable on SM89.
Fix: Add attn_backend: FLASHINFER and model_kwargs: {num_hidden_layers: 4} to disagg_config_cache_reuse_deepseek_v3.yaml (used only by this test); two workers fit on L40S and FLASHINFER MLA bypasses the SM90 FMHA assertion.
Automated fix generated by repair-bot

Test plan

Verify fix on the same GPU type as the original failure
Check for regressions in related tests

Links

Bug: https://nvbugs/6329052

Summary by CodeRabbit

Tests
- Updated model configuration settings for specific model variants
- Adjusted test coverage entries

…from QA cross-GPU list The QA cross-GPU test list (tests/integration/test_lists/qa/llm_function_core.txt) carried test_workers.py::test_workers_conditional_disaggregation_deepseek_v3_lite_bf16, even though the test's only test-db entry is l0_dgx_h100.yml. When QA ran that list against the L40S pool, background_workers() collapsed both ctx and gen workers onto a single L40S (44 GiB), where two ~40 GiB DeepSeek-V3-Lite/bf16 weight copies cannot coexist - second worker OOMs in model_loader.py:init_meta_tensor. Two ~40 GiB copies on a 44 GiB device is a hard hardware limit, not a budgeting bug: weights alone (independent of free_gpu_memory_fraction or max_num_tokens) exceed device capacity. The fix is at the QA-list level: - Remove the test from llm_function_core.txt so the cross-GPU QA pipeline no longer collects it on hardware that cannot satisfy its memory needs. - Remove the now-redundant L40S waiver in waives.txt. The DGX-H100 CI coverage is unchanged - the test remains in test_lists/test-db/l0_dgx_h100.yml. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

…disagg conditional test Run the workers conditional-disaggregation test for DeepSeek-V3-Lite/bf16 with attn_backend=FLASHINFER and num_hidden_layers=4 so it can pass on a single 44 GiB L40S host (and runs faster on multi-GPU hosts). Two ~38 GiB worker copies of the full 30-layer bf16 checkpoint cannot share a 44 GiB GPU (hard hardware limit; weights alone exceed device capacity, see the OOM at model_loader.py:468 init_meta_tensor). Reducing to 4 layers shrinks per-worker weight footprint by ~7x so two workers fit. The default TRTLLM attn backend asserts in attentionOp.cpp:3091 'Deepseek should be supported by fmha in generation part.' on SM89; FLASHINFER provides an MLA path that does not depend on the SM90 FMHA cubin set. The test exercises disagg orchestration (router decisions, KV cache events, prefix matching, multi-round chat) -- not model accuracy -- so the smaller layer count and alternative attention backend do not change what is being verified. The YAML is consumed only by this test. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

coderabbitai · 2026-06-17T21:09:10Z

📝 Walkthrough

Walkthrough

Adds attn_backend: FLASHINFER and a model_kwargs block with num_hidden_layers: 4 to the disaggregated cache-reuse DeepSeek-V3-Lite test config. Removes the test_workers_conditional_disaggregation_deepseek_v3_lite_bf16 entry from the QA test list and the corresponding L40S skip waiver.

Changes

DeepSeek-V3-Lite Disaggregated Test Enablement

Layer / File(s)	Summary
Config update and test list/waiver cleanup `tests/integration/defs/disaggregated/test_configs/disagg_config_cache_reuse_deepseek_v3.yaml`, `tests/integration/test_lists/qa/llm_function_core.txt`, `tests/integration/test_lists/waives.txt`	Adds `attn_backend: FLASHINFER` and `model_kwargs: num_hidden_layers: 4` to the disagg config; removes the `test_workers_conditional_disaggregation_deepseek_v3_lite_bf16[DeepSeek-V3-Lite-bf16]` entry from the QA run list and drops its `SKIP` waiver for `full:L40S`.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#15214: Also modifies tests/integration/test_lists/waives.txt to remove a waiver entry for a different integration test case.
NVIDIA/TensorRT-LLM#15389: Directly inverse change — adds the same full:L40S/disaggregated/test_workers.py::test_workers_conditional_disaggregation_deepseek_v3_lite_bf16 waiver entry that this PR removes.

Suggested reviewers

tburt-nv
pcastonguay

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main fix (adding attn_backend: FLASHINFER and model_kwargs configuration) to the specific configuration file, which aligns with the primary changes in the changeset.
Description check	✅ Passed	The description provides a clear summary of the root cause, the fix applied, test plan confirmation, and links to the relevant bug. It follows the template structure with appropriate sections and sufficient detail.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tests/integration/defs/disaggregated/test_configs/disagg_config_cache_reuse_deepseek_v3.yaml`:
- Around line 5-10: The configuration changes (attn_backend: FLASHINFER and
model_kwargs.num_hidden_layers: 4) were applied to the wrong file. The test
test_disaggregated_deepseek_v3_lite_bf16_conditional actually uses
disagg_config_conditional_deepseek_v3.yaml, not
disagg_config_cache_reuse_deepseek_v3.yaml. Revert the changes made to
disagg_config_cache_reuse_deepseek_v3.yaml and instead apply the same two
modifications (adding attn_backend: FLASHINFER and setting
model_kwargs.num_hidden_layers to 4) to the
disagg_config_conditional_deepseek_v3.yaml file to fix the failing test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 84282347-8144-4f9e-95c7-429fe86330df

📥 Commits

Reviewing files that changed from the base of the PR and between 42a3e55 and a4a8e88.

📒 Files selected for processing (3)

tests/integration/defs/disaggregated/test_configs/disagg_config_cache_reuse_deepseek_v3.yaml
tests/integration/test_lists/qa/llm_function_core.txt
tests/integration/test_lists/waives.txt

💤 Files with no reviewable changes (2)

tests/integration/test_lists/qa/llm_function_core.txt
tests/integration/test_lists/waives.txt

coderabbitai · 2026-06-17T21:09:13Z

+attn_backend: FLASHINFER
 cuda_graph_config: null
 disable_overlap_scheduler: true
 enable_autotuner: false
+model_kwargs:
+  num_hidden_layers: 4


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify which YAML the conditional test uses, and whether it references cache_reuse config. set -euo pipefail echo "== Mapping for deepseek_v3_lite_bf16_conditional ==" rg -n -C3 'deepseek_v3_lite_bf16_conditional|disagg_config_conditional_deepseek_v3\.yaml' \ tests/integration/defs/disaggregated/test_disaggregated.py echo echo "== Does conditional config reference cache_reuse config? ==" rg -n -C3 'disagg_config_cache_reuse_deepseek_v3\.yaml|attn_backend|num_hidden_layers' \ tests/integration/defs/disaggregated/test_configs/disagg_config_conditional_deepseek_v3.yaml \ tests/integration/defs/disaggregated/test_configs/disagg_config_cache_reuse_deepseek_v3.yaml

Repository: NVIDIA/TensorRT-LLM

Length of output: 2961

🏁 Script executed:

cat -n tests/integration/defs/disaggregated/test_configs/disagg_config_conditional_deepseek_v3.yaml

Repository: NVIDIA/TensorRT-LLM

Length of output: 1138

Changes applied to incorrect config file; conditional test uses a different configuration.

The modified disagg_config_cache_reuse_deepseek_v3.yaml (lines 5 and 9–10) is not used by test_disaggregated_deepseek_v3_lite_bf16_conditional. That test uses disagg_config_conditional_deepseek_v3.yaml (confirmed at test_disaggregated.py:282–283), which has a completely different structure with no attn_backend or model_kwargs fields.

To fix the failing conditional test, apply the same changes (attn_backend: FLASHINFER and model_kwargs.num_hidden_layers: 4) to tests/integration/defs/disaggregated/test_configs/disagg_config_conditional_deepseek_v3.yaml instead.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/defs/disaggregated/test_configs/disagg_config_cache_reuse_deepseek_v3.yaml` around lines 5 - 10, The configuration changes (attn_backend: FLASHINFER and model_kwargs.num_hidden_layers: 4) were applied to the wrong file. The test test_disaggregated_deepseek_v3_lite_bf16_conditional actually uses disagg_config_conditional_deepseek_v3.yaml, not disagg_config_cache_reuse_deepseek_v3.yaml. Revert the changes made to disagg_config_cache_reuse_deepseek_v3.yaml and instead apply the same two modifications (adding attn_backend: FLASHINFER and setting model_kwargs.num_hidden_layers to 4) to the disagg_config_conditional_deepseek_v3.yaml file to fix the failing test.

Source: Coding guidelines

tensorrt-cicd added 2 commits June 17, 2026 13:06

tensorrt-cicd requested review from a team as code owners June 17, 2026 21:05

tensorrt-cicd assigned Shixiaowei02 Jun 17, 2026

github-actions Bot assigned tensorrt-cicd Jun 17, 2026

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6329052][fix] Add `attn_backend: FLASHINFER` and `model_kwargs: {num_hidden_layers: 4}` to…#15464

[https://nvbugs/6329052][fix] Add `attn_backend: FLASHINFER` and `model_kwargs: {num_hidden_layers: 4}` to…#15464
tensorrt-cicd wants to merge 2 commits into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6329052

tensorrt-cicd commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 17, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tensorrt-cicd commented Jun 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 17, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tensorrt-cicd commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading