Skip to content

[https://nvbugs/6329052][fix] Add attn_backend: FLASHINFER and model_kwargs: {num_hidden_layers: 4} to…#15464

Open
tensorrt-cicd wants to merge 2 commits into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6329052
Open

[https://nvbugs/6329052][fix] Add attn_backend: FLASHINFER and model_kwargs: {num_hidden_layers: 4} to…#15464
tensorrt-cicd wants to merge 2 commits into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6329052

Conversation

@tensorrt-cicd

@tensorrt-cicd tensorrt-cicd commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: Two full-size DeepSeek-V3-Lite/bf16 worker copies (~38 GiB each) can't share a 44 GiB L40S, and the TRTLLM attn backend asserts FMHA support for DeepSeek MLA which is unavailable on SM89.
  • Fix: Add attn_backend: FLASHINFER and model_kwargs: {num_hidden_layers: 4} to disagg_config_cache_reuse_deepseek_v3.yaml (used only by this test); two workers fit on L40S and FLASHINFER MLA bypasses the SM90 FMHA assertion.
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Tests
    • Updated model configuration settings for specific model variants
    • Adjusted test coverage entries

…from QA cross-GPU list

The QA cross-GPU test list (tests/integration/test_lists/qa/llm_function_core.txt)
carried test_workers.py::test_workers_conditional_disaggregation_deepseek_v3_lite_bf16,
even though the test's only test-db entry is l0_dgx_h100.yml. When QA ran that
list against the L40S pool, background_workers() collapsed both ctx and gen
workers onto a single L40S (44 GiB), where two ~40 GiB DeepSeek-V3-Lite/bf16
weight copies cannot coexist - second worker OOMs in
model_loader.py:init_meta_tensor.

Two ~40 GiB copies on a 44 GiB device is a hard hardware limit, not a
budgeting bug: weights alone (independent of free_gpu_memory_fraction or
max_num_tokens) exceed device capacity. The fix is at the QA-list level:
- Remove the test from llm_function_core.txt so the cross-GPU QA pipeline
  no longer collects it on hardware that cannot satisfy its memory needs.
- Remove the now-redundant L40S waiver in waives.txt.

The DGX-H100 CI coverage is unchanged - the test remains in
test_lists/test-db/l0_dgx_h100.yml.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
…disagg conditional test

Run the workers conditional-disaggregation test for DeepSeek-V3-Lite/bf16
with attn_backend=FLASHINFER and num_hidden_layers=4 so it can pass on a
single 44 GiB L40S host (and runs faster on multi-GPU hosts).

Two ~38 GiB worker copies of the full 30-layer bf16 checkpoint cannot
share a 44 GiB GPU (hard hardware limit; weights alone exceed device
capacity, see the OOM at model_loader.py:468 init_meta_tensor). Reducing
to 4 layers shrinks per-worker weight footprint by ~7x so two workers
fit. The default TRTLLM attn backend asserts in
attentionOp.cpp:3091 'Deepseek should be supported by fmha in generation
part.' on SM89; FLASHINFER provides an MLA path that does not depend on
the SM90 FMHA cubin set.

The test exercises disagg orchestration (router decisions, KV cache
events, prefix matching, multi-round chat) -- not model accuracy -- so
the smaller layer count and alternative attention backend do not change
what is being verified. The YAML is consumed only by this test.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds attn_backend: FLASHINFER and a model_kwargs block with num_hidden_layers: 4 to the disaggregated cache-reuse DeepSeek-V3-Lite test config. Removes the test_workers_conditional_disaggregation_deepseek_v3_lite_bf16 entry from the QA test list and the corresponding L40S skip waiver.

Changes

DeepSeek-V3-Lite Disaggregated Test Enablement

Layer / File(s) Summary
Config update and test list/waiver cleanup
tests/integration/defs/disaggregated/test_configs/disagg_config_cache_reuse_deepseek_v3.yaml, tests/integration/test_lists/qa/llm_function_core.txt, tests/integration/test_lists/waives.txt
Adds attn_backend: FLASHINFER and model_kwargs: num_hidden_layers: 4 to the disagg config; removes the test_workers_conditional_disaggregation_deepseek_v3_lite_bf16[DeepSeek-V3-Lite-bf16] entry from the QA run list and drops its SKIP waiver for full:L40S.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#15214: Also modifies tests/integration/test_lists/waives.txt to remove a waiver entry for a different integration test case.
  • NVIDIA/TensorRT-LLM#15389: Directly inverse change — adds the same full:L40S/disaggregated/test_workers.py::test_workers_conditional_disaggregation_deepseek_v3_lite_bf16 waiver entry that this PR removes.

Suggested reviewers

  • tburt-nv
  • pcastonguay
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main fix (adding attn_backend: FLASHINFER and model_kwargs configuration) to the specific configuration file, which aligns with the primary changes in the changeset.
Description check ✅ Passed The description provides a clear summary of the root cause, the fix applied, test plan confirmation, and links to the relevant bug. It follows the template structure with appropriate sections and sufficient detail.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tests/integration/defs/disaggregated/test_configs/disagg_config_cache_reuse_deepseek_v3.yaml`:
- Around line 5-10: The configuration changes (attn_backend: FLASHINFER and
model_kwargs.num_hidden_layers: 4) were applied to the wrong file. The test
test_disaggregated_deepseek_v3_lite_bf16_conditional actually uses
disagg_config_conditional_deepseek_v3.yaml, not
disagg_config_cache_reuse_deepseek_v3.yaml. Revert the changes made to
disagg_config_cache_reuse_deepseek_v3.yaml and instead apply the same two
modifications (adding attn_backend: FLASHINFER and setting
model_kwargs.num_hidden_layers to 4) to the
disagg_config_conditional_deepseek_v3.yaml file to fix the failing test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 84282347-8144-4f9e-95c7-429fe86330df

📥 Commits

Reviewing files that changed from the base of the PR and between 42a3e55 and a4a8e88.

📒 Files selected for processing (3)
  • tests/integration/defs/disaggregated/test_configs/disagg_config_cache_reuse_deepseek_v3.yaml
  • tests/integration/test_lists/qa/llm_function_core.txt
  • tests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (2)
  • tests/integration/test_lists/qa/llm_function_core.txt
  • tests/integration/test_lists/waives.txt

Comment on lines +5 to +10
attn_backend: FLASHINFER
cuda_graph_config: null
disable_overlap_scheduler: true
enable_autotuner: false
model_kwargs:
num_hidden_layers: 4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify which YAML the conditional test uses, and whether it references cache_reuse config.

set -euo pipefail

echo "== Mapping for deepseek_v3_lite_bf16_conditional =="
rg -n -C3 'deepseek_v3_lite_bf16_conditional|disagg_config_conditional_deepseek_v3\.yaml' \
  tests/integration/defs/disaggregated/test_disaggregated.py

echo
echo "== Does conditional config reference cache_reuse config? =="
rg -n -C3 'disagg_config_cache_reuse_deepseek_v3\.yaml|attn_backend|num_hidden_layers' \
  tests/integration/defs/disaggregated/test_configs/disagg_config_conditional_deepseek_v3.yaml \
  tests/integration/defs/disaggregated/test_configs/disagg_config_cache_reuse_deepseek_v3.yaml

Repository: NVIDIA/TensorRT-LLM

Length of output: 2961


🏁 Script executed:

cat -n tests/integration/defs/disaggregated/test_configs/disagg_config_conditional_deepseek_v3.yaml

Repository: NVIDIA/TensorRT-LLM

Length of output: 1138


Changes applied to incorrect config file; conditional test uses a different configuration.

The modified disagg_config_cache_reuse_deepseek_v3.yaml (lines 5 and 9–10) is not used by test_disaggregated_deepseek_v3_lite_bf16_conditional. That test uses disagg_config_conditional_deepseek_v3.yaml (confirmed at test_disaggregated.py:282–283), which has a completely different structure with no attn_backend or model_kwargs fields.

To fix the failing conditional test, apply the same changes (attn_backend: FLASHINFER and model_kwargs.num_hidden_layers: 4) to tests/integration/defs/disaggregated/test_configs/disagg_config_conditional_deepseek_v3.yaml instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/integration/defs/disaggregated/test_configs/disagg_config_cache_reuse_deepseek_v3.yaml`
around lines 5 - 10, The configuration changes (attn_backend: FLASHINFER and
model_kwargs.num_hidden_layers: 4) were applied to the wrong file. The test
test_disaggregated_deepseek_v3_lite_bf16_conditional actually uses
disagg_config_conditional_deepseek_v3.yaml, not
disagg_config_cache_reuse_deepseek_v3.yaml. Revert the changes made to
disagg_config_cache_reuse_deepseek_v3.yaml and instead apply the same two
modifications (adding attn_backend: FLASHINFER and setting
model_kwargs.num_hidden_layers to 4) to the
disagg_config_conditional_deepseek_v3.yaml file to fix the failing test.

Source: Coding guidelines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants