[https://nvbugs/6329052][fix] Add attn_backend: FLASHINFER and model_kwargs: {num_hidden_layers: 4} to…#15464
[https://nvbugs/6329052][fix] Add attn_backend: FLASHINFER and model_kwargs: {num_hidden_layers: 4} to…#15464tensorrt-cicd wants to merge 2 commits into
attn_backend: FLASHINFER and model_kwargs: {num_hidden_layers: 4} to…#15464Conversation
…from QA cross-GPU list The QA cross-GPU test list (tests/integration/test_lists/qa/llm_function_core.txt) carried test_workers.py::test_workers_conditional_disaggregation_deepseek_v3_lite_bf16, even though the test's only test-db entry is l0_dgx_h100.yml. When QA ran that list against the L40S pool, background_workers() collapsed both ctx and gen workers onto a single L40S (44 GiB), where two ~40 GiB DeepSeek-V3-Lite/bf16 weight copies cannot coexist - second worker OOMs in model_loader.py:init_meta_tensor. Two ~40 GiB copies on a 44 GiB device is a hard hardware limit, not a budgeting bug: weights alone (independent of free_gpu_memory_fraction or max_num_tokens) exceed device capacity. The fix is at the QA-list level: - Remove the test from llm_function_core.txt so the cross-GPU QA pipeline no longer collects it on hardware that cannot satisfy its memory needs. - Remove the now-redundant L40S waiver in waives.txt. The DGX-H100 CI coverage is unchanged - the test remains in test_lists/test-db/l0_dgx_h100.yml. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
…disagg conditional test Run the workers conditional-disaggregation test for DeepSeek-V3-Lite/bf16 with attn_backend=FLASHINFER and num_hidden_layers=4 so it can pass on a single 44 GiB L40S host (and runs faster on multi-GPU hosts). Two ~38 GiB worker copies of the full 30-layer bf16 checkpoint cannot share a 44 GiB GPU (hard hardware limit; weights alone exceed device capacity, see the OOM at model_loader.py:468 init_meta_tensor). Reducing to 4 layers shrinks per-worker weight footprint by ~7x so two workers fit. The default TRTLLM attn backend asserts in attentionOp.cpp:3091 'Deepseek should be supported by fmha in generation part.' on SM89; FLASHINFER provides an MLA path that does not depend on the SM90 FMHA cubin set. The test exercises disagg orchestration (router decisions, KV cache events, prefix matching, multi-round chat) -- not model accuracy -- so the smaller layer count and alternative attention backend do not change what is being verified. The YAML is consumed only by this test. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
📝 WalkthroughWalkthroughAdds ChangesDeepSeek-V3-Lite Disaggregated Test Enablement
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@tests/integration/defs/disaggregated/test_configs/disagg_config_cache_reuse_deepseek_v3.yaml`:
- Around line 5-10: The configuration changes (attn_backend: FLASHINFER and
model_kwargs.num_hidden_layers: 4) were applied to the wrong file. The test
test_disaggregated_deepseek_v3_lite_bf16_conditional actually uses
disagg_config_conditional_deepseek_v3.yaml, not
disagg_config_cache_reuse_deepseek_v3.yaml. Revert the changes made to
disagg_config_cache_reuse_deepseek_v3.yaml and instead apply the same two
modifications (adding attn_backend: FLASHINFER and setting
model_kwargs.num_hidden_layers to 4) to the
disagg_config_conditional_deepseek_v3.yaml file to fix the failing test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 84282347-8144-4f9e-95c7-429fe86330df
📒 Files selected for processing (3)
tests/integration/defs/disaggregated/test_configs/disagg_config_cache_reuse_deepseek_v3.yamltests/integration/test_lists/qa/llm_function_core.txttests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (2)
- tests/integration/test_lists/qa/llm_function_core.txt
- tests/integration/test_lists/waives.txt
| attn_backend: FLASHINFER | ||
| cuda_graph_config: null | ||
| disable_overlap_scheduler: true | ||
| enable_autotuner: false | ||
| model_kwargs: | ||
| num_hidden_layers: 4 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify which YAML the conditional test uses, and whether it references cache_reuse config.
set -euo pipefail
echo "== Mapping for deepseek_v3_lite_bf16_conditional =="
rg -n -C3 'deepseek_v3_lite_bf16_conditional|disagg_config_conditional_deepseek_v3\.yaml' \
tests/integration/defs/disaggregated/test_disaggregated.py
echo
echo "== Does conditional config reference cache_reuse config? =="
rg -n -C3 'disagg_config_cache_reuse_deepseek_v3\.yaml|attn_backend|num_hidden_layers' \
tests/integration/defs/disaggregated/test_configs/disagg_config_conditional_deepseek_v3.yaml \
tests/integration/defs/disaggregated/test_configs/disagg_config_cache_reuse_deepseek_v3.yamlRepository: NVIDIA/TensorRT-LLM
Length of output: 2961
🏁 Script executed:
cat -n tests/integration/defs/disaggregated/test_configs/disagg_config_conditional_deepseek_v3.yamlRepository: NVIDIA/TensorRT-LLM
Length of output: 1138
Changes applied to incorrect config file; conditional test uses a different configuration.
The modified disagg_config_cache_reuse_deepseek_v3.yaml (lines 5 and 9–10) is not used by test_disaggregated_deepseek_v3_lite_bf16_conditional. That test uses disagg_config_conditional_deepseek_v3.yaml (confirmed at test_disaggregated.py:282–283), which has a completely different structure with no attn_backend or model_kwargs fields.
To fix the failing conditional test, apply the same changes (attn_backend: FLASHINFER and model_kwargs.num_hidden_layers: 4) to tests/integration/defs/disaggregated/test_configs/disagg_config_conditional_deepseek_v3.yaml instead.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@tests/integration/defs/disaggregated/test_configs/disagg_config_cache_reuse_deepseek_v3.yaml`
around lines 5 - 10, The configuration changes (attn_backend: FLASHINFER and
model_kwargs.num_hidden_layers: 4) were applied to the wrong file. The test
test_disaggregated_deepseek_v3_lite_bf16_conditional actually uses
disagg_config_conditional_deepseek_v3.yaml, not
disagg_config_cache_reuse_deepseek_v3.yaml. Revert the changes made to
disagg_config_cache_reuse_deepseek_v3.yaml and instead apply the same two
modifications (adding attn_backend: FLASHINFER and setting
model_kwargs.num_hidden_layers to 4) to the
disagg_config_conditional_deepseek_v3.yaml file to fix the failing test.
Source: Coding guidelines
Summary
attn_backend: FLASHINFERandmodel_kwargs: {num_hidden_layers: 4}to disagg_config_cache_reuse_deepseek_v3.yaml (used only by this test); two workers fit on L40S and FLASHINFER MLA bypasses the SM90 FMHA assertion.Test plan
Links
Summary by CodeRabbit