DO NOT REVIEW [None][chore] Test MXFP4 MegaMoE integration by longlee0622 · Pull Request #15436 · NVIDIA/TensorRT-LLM

longlee0622 · 2026-06-17T00:57:41Z

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Wire the MEGAMOE_DEEPGEMM backend to feed MXFP4 (E2M1) activations through the DeepGEMM fork's mega_moe_pre_dispatch kernel when DG_USE_FP4_ACTS=1: - _dg_use_fp4_acts() gates the FP4-acts path; default off keeps the FP8-acts path byte-for-byte unchanged. - supports_fused_prepare() returns True under FP4 acts (the SymmBuffer x/x_sf slots are FP4-sized; the trtllm mxfp8 prepare op would write mismatched data). - run_moe fills the SymmBuffer via dg.mega_moe_pre_dispatch(BF16 -> packed E2M1) instead of the FP8 quantize+copy. - Bump the DeepGEMM pin to longlee0622/DeepGEMM @ 7fb6e0d (adds the FP4-acts kernels + the buf_x row-stride fix that makes them correct). DeepSeek-V4-Flash W4A8_MXFP4_MXFP8, TP4/EP4: GSM8K 90 with DG_USE_FP4_ACTS=1, on par with the FP8-acts path within noise. Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>

Bump the longlee0622/DeepGEMM pin from 7fb6e0d (stride-fix only, which still deadlocks the kind::mxf4 2-CTA MegaMoE on small per-expert token tiles) to 9b2f238 = user/jinshik/mxf4-dense-fp4-load-loop HEAD, which adds f558a8e (loop the 2SM dense-FP4 A/B load per swizzle-atom; fixes the block_k=256 full_barrier tx-deficit deadlock) plus the repro test + doc. Repo unchanged; this controls both the deep_gemm_cpp_tllm extension and the Python deep_gemm imported by the MEGAMOE_DEEPGEMM MoE backend. Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>

…DG_USE_MXF4_KIND knob Add TestDeepSeekV4Flash::test_mxfp4_acts_4gpus_static_eplb: the same TP=4/ EP=4/ADP static-EPLB GSM8K path as the MEGAMOE_DEEPGEMM variant, but with DG_USE_FP4_ACTS=1 + DG_USE_MXF4_KIND=1 so DeepGEMM runs the W4A4 (MXFP4xMXFP4) dense mxf4 kernel instead of the default MXFP8xMXFP4 path. monkeypatch sets the env on every MPI rank before LLM construction. Add a DG_USE_MXF4_KIND parity knob to MegaMoEDeepGemm (mega_moe_deepgemm.py): - _dg_use_mxf4_kind() mirrors _dg_use_fp4_acts() (env-read; the DG kernel selects the mainloop itself, no Python kwarg). - Surface mxf4_kind in the SymmBuffer-alloc diagnostic log. - Fail fast (ValueError) on DG_USE_MXF4_KIND=1 without DG_USE_FP4_ACTS=1, mirroring DG's host assert. DG_USE_MXF4_KIND is a pure-performance knob: it does not change numerics vs kind::mxf8f6f4 (sentinel-verified rel-RMSE 0). Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>

longlee0622 · 2026-06-17T00:58:56Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-17T01:06:07Z

PR_Github #54697 [ run ] triggered by Bot. Commit: 0ab4c48 Link to invocation

longlee0622 · 2026-06-17T02:59:12Z

/bot kill

Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>

longlee0622 · 2026-06-17T03:03:21Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-17T03:05:53Z

PR_Github #54728 [ kill ] triggered by Bot. Commit: add8721 Link to invocation

longlee0622 · 2026-06-17T03:07:16Z

/bot kill

tensorrt-cicd · 2026-06-17T03:07:21Z

PR_Github #54697 [ run ] completed with state ABORTED. Commit: 0ab4c48

Link to invocation

tensorrt-cicd · 2026-06-17T03:07:28Z

PR_Github #54728 [ kill ] completed with state SUCCESS. Commit: add8721
Successfully killed previous jobs for commit add8721

Link to invocation

tensorrt-cicd · 2026-06-17T03:10:22Z

PR_Github #54730 [ run ] triggered by Bot. Commit: add8721 Link to invocation

tensorrt-cicd · 2026-06-17T03:14:01Z

PR_Github #54733 [ kill ] triggered by Bot. Commit: add8721 Link to invocation

tensorrt-cicd · 2026-06-17T03:18:33Z

PR_Github #54730 [ run ] completed with state ABORTED. Commit: add8721

Link to invocation

tensorrt-cicd · 2026-06-17T03:18:46Z

PR_Github #54733 [ kill ] completed with state SUCCESS. Commit: add8721
Successfully killed previous jobs for commit add8721

Link to invocation

MpiPoolSession._start_mpi_pool only forwarded TRTLLM*/TLLM* env to the mpi4py workers, so DeepGEMM knobs (DG_USE_FP4_ACTS, DG_USE_MXF4_KIND) set in the driver (e.g. via a test monkeypatch) never reached the workers -- they silently ran the default FP8-acts kernel instead of W4A4 mxf4. Forward DG_* too so the worker-side kernel/SymmBuffer selection matches the driver's intent. Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>

longlee0622 · 2026-06-17T04:34:48Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-17T04:41:52Z

PR_Github #54752 [ run ] triggered by Bot. Commit: aa24bc0 Link to invocation

tensorrt-cicd · 2026-06-17T10:21:49Z

PR_Github #54752 [ run ] completed with state SUCCESS. Commit: aa24bc0
/LLM/main/L0_MergeRequest_PR pipeline #43772 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Barry-Delaney and others added 3 commits June 15, 2026 20:26

longlee0622 requested review from a team as code owners June 17, 2026 00:57

longlee0622 requested review from mikeiovine and removed request for a team June 17, 2026 00:57

longlee0622 marked this pull request as draft June 17, 2026 00:57

github-actions Bot assigned longlee0622 Jun 17, 2026

test: add MXFP4 MegaMoE case to B300 L0

add8721

Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>

Conversation

longlee0622 commented Jun 17, 2026

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

longlee0622 commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

longlee0622 commented Jun 17, 2026

Uh oh!

longlee0622 commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

longlee0622 commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

longlee0622 commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants