Skip to content

DO NOT REVIEW [None][chore] Test MXFP4 MegaMoE integration#15436

Draft
longlee0622 wants to merge 5 commits into
NVIDIA:feat/deepseek_v4from
longlee0622:user/jinshik/dsv4-mxfp4-acts-c1
Draft

DO NOT REVIEW [None][chore] Test MXFP4 MegaMoE integration#15436
longlee0622 wants to merge 5 commits into
NVIDIA:feat/deepseek_v4from
longlee0622:user/jinshik/dsv4-mxfp4-acts-c1

Conversation

@longlee0622

Copy link
Copy Markdown
Collaborator

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Barry-Delaney and others added 3 commits June 15, 2026 20:26
Wire the MEGAMOE_DEEPGEMM backend to feed MXFP4 (E2M1) activations through the
DeepGEMM fork's mega_moe_pre_dispatch kernel when DG_USE_FP4_ACTS=1:

- _dg_use_fp4_acts() gates the FP4-acts path; default off keeps the FP8-acts
  path byte-for-byte unchanged.
- supports_fused_prepare() returns True under FP4 acts (the SymmBuffer x/x_sf
  slots are FP4-sized; the trtllm mxfp8 prepare op would write mismatched data).
- run_moe fills the SymmBuffer via dg.mega_moe_pre_dispatch(BF16 -> packed
  E2M1) instead of the FP8 quantize+copy.
- Bump the DeepGEMM pin to longlee0622/DeepGEMM @ 7fb6e0d (adds the FP4-acts
  kernels + the buf_x row-stride fix that makes them correct).

DeepSeek-V4-Flash W4A8_MXFP4_MXFP8, TP4/EP4: GSM8K 90 with DG_USE_FP4_ACTS=1,
on par with the FP8-acts path within noise.

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Bump the longlee0622/DeepGEMM pin from 7fb6e0d (stride-fix only, which
still deadlocks the kind::mxf4 2-CTA MegaMoE on small per-expert token
tiles) to 9b2f238 = user/jinshik/mxf4-dense-fp4-load-loop HEAD, which adds
f558a8e (loop the 2SM dense-FP4 A/B load per swizzle-atom; fixes the
block_k=256 full_barrier tx-deficit deadlock) plus the repro test + doc.

Repo unchanged; this controls both the deep_gemm_cpp_tllm extension and
the Python deep_gemm imported by the MEGAMOE_DEEPGEMM MoE backend.

Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>
…DG_USE_MXF4_KIND knob

Add TestDeepSeekV4Flash::test_mxfp4_acts_4gpus_static_eplb: the same TP=4/
EP=4/ADP static-EPLB GSM8K path as the MEGAMOE_DEEPGEMM variant, but with
DG_USE_FP4_ACTS=1 + DG_USE_MXF4_KIND=1 so DeepGEMM runs the W4A4
(MXFP4xMXFP4) dense mxf4 kernel instead of the default MXFP8xMXFP4 path.
monkeypatch sets the env on every MPI rank before LLM construction.

Add a DG_USE_MXF4_KIND parity knob to MegaMoEDeepGemm (mega_moe_deepgemm.py):
- _dg_use_mxf4_kind() mirrors _dg_use_fp4_acts() (env-read; the DG kernel
  selects the mainloop itself, no Python kwarg).
- Surface mxf4_kind in the SymmBuffer-alloc diagnostic log.
- Fail fast (ValueError) on DG_USE_MXF4_KIND=1 without DG_USE_FP4_ACTS=1,
  mirroring DG's host assert. DG_USE_MXF4_KIND is a pure-performance knob:
  it does not change numerics vs kind::mxf8f6f4 (sentinel-verified rel-RMSE 0).

Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>
@longlee0622 longlee0622 requested review from a team as code owners June 17, 2026 00:57
@longlee0622 longlee0622 requested review from mikeiovine and removed request for a team June 17, 2026 00:57
@longlee0622 longlee0622 marked this pull request as draft June 17, 2026 00:57
@longlee0622

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54697 [ run ] triggered by Bot. Commit: 0ab4c48 Link to invocation

@longlee0622

Copy link
Copy Markdown
Collaborator Author

/bot kill

Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>
@longlee0622

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54728 [ kill ] triggered by Bot. Commit: add8721 Link to invocation

@longlee0622

Copy link
Copy Markdown
Collaborator Author

/bot kill

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54697 [ run ] completed with state ABORTED. Commit: 0ab4c48

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54728 [ kill ] completed with state SUCCESS. Commit: add8721
Successfully killed previous jobs for commit add8721

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54730 [ run ] triggered by Bot. Commit: add8721 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54733 [ kill ] triggered by Bot. Commit: add8721 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54730 [ run ] completed with state ABORTED. Commit: add8721

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54733 [ kill ] completed with state SUCCESS. Commit: add8721
Successfully killed previous jobs for commit add8721

Link to invocation

MpiPoolSession._start_mpi_pool only forwarded TRTLLM*/TLLM* env to the
mpi4py workers, so DeepGEMM knobs (DG_USE_FP4_ACTS, DG_USE_MXF4_KIND) set
in the driver (e.g. via a test monkeypatch) never reached the workers --
they silently ran the default FP8-acts kernel instead of W4A4 mxf4.
Forward DG_* too so the worker-side kernel/SymmBuffer selection matches
the driver's intent.

Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>
@longlee0622

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54752 [ run ] triggered by Bot. Commit: aa24bc0 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54752 [ run ] completed with state SUCCESS. Commit: aa24bc0
/LLM/main/L0_MergeRequest_PR pipeline #43772 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants