DO NOT REVIEW [None][chore] Test MXFP4 MegaMoE integration#15436
DO NOT REVIEW [None][chore] Test MXFP4 MegaMoE integration#15436longlee0622 wants to merge 5 commits into
Conversation
Wire the MEGAMOE_DEEPGEMM backend to feed MXFP4 (E2M1) activations through the DeepGEMM fork's mega_moe_pre_dispatch kernel when DG_USE_FP4_ACTS=1: - _dg_use_fp4_acts() gates the FP4-acts path; default off keeps the FP8-acts path byte-for-byte unchanged. - supports_fused_prepare() returns True under FP4 acts (the SymmBuffer x/x_sf slots are FP4-sized; the trtllm mxfp8 prepare op would write mismatched data). - run_moe fills the SymmBuffer via dg.mega_moe_pre_dispatch(BF16 -> packed E2M1) instead of the FP8 quantize+copy. - Bump the DeepGEMM pin to longlee0622/DeepGEMM @ 7fb6e0d (adds the FP4-acts kernels + the buf_x row-stride fix that makes them correct). DeepSeek-V4-Flash W4A8_MXFP4_MXFP8, TP4/EP4: GSM8K 90 with DG_USE_FP4_ACTS=1, on par with the FP8-acts path within noise. Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Bump the longlee0622/DeepGEMM pin from 7fb6e0d (stride-fix only, which still deadlocks the kind::mxf4 2-CTA MegaMoE on small per-expert token tiles) to 9b2f238 = user/jinshik/mxf4-dense-fp4-load-loop HEAD, which adds f558a8e (loop the 2SM dense-FP4 A/B load per swizzle-atom; fixes the block_k=256 full_barrier tx-deficit deadlock) plus the repro test + doc. Repo unchanged; this controls both the deep_gemm_cpp_tllm extension and the Python deep_gemm imported by the MEGAMOE_DEEPGEMM MoE backend. Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>
…DG_USE_MXF4_KIND knob Add TestDeepSeekV4Flash::test_mxfp4_acts_4gpus_static_eplb: the same TP=4/ EP=4/ADP static-EPLB GSM8K path as the MEGAMOE_DEEPGEMM variant, but with DG_USE_FP4_ACTS=1 + DG_USE_MXF4_KIND=1 so DeepGEMM runs the W4A4 (MXFP4xMXFP4) dense mxf4 kernel instead of the default MXFP8xMXFP4 path. monkeypatch sets the env on every MPI rank before LLM construction. Add a DG_USE_MXF4_KIND parity knob to MegaMoEDeepGemm (mega_moe_deepgemm.py): - _dg_use_mxf4_kind() mirrors _dg_use_fp4_acts() (env-read; the DG kernel selects the mainloop itself, no Python kwarg). - Surface mxf4_kind in the SymmBuffer-alloc diagnostic log. - Fail fast (ValueError) on DG_USE_MXF4_KIND=1 without DG_USE_FP4_ACTS=1, mirroring DG's host assert. DG_USE_MXF4_KIND is a pure-performance knob: it does not change numerics vs kind::mxf8f6f4 (sentinel-verified rel-RMSE 0). Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #54697 [ run ] triggered by Bot. Commit: |
|
/bot kill |
Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #54728 [ kill ] triggered by Bot. Commit: |
|
/bot kill |
|
PR_Github #54697 [ run ] completed with state |
|
PR_Github #54728 [ kill ] completed with state |
|
PR_Github #54730 [ run ] triggered by Bot. Commit: |
|
PR_Github #54733 [ kill ] triggered by Bot. Commit: |
|
PR_Github #54730 [ run ] completed with state |
|
PR_Github #54733 [ kill ] completed with state |
MpiPoolSession._start_mpi_pool only forwarded TRTLLM*/TLLM* env to the mpi4py workers, so DeepGEMM knobs (DG_USE_FP4_ACTS, DG_USE_MXF4_KIND) set in the driver (e.g. via a test monkeypatch) never reached the workers -- they silently ran the default FP8-acts kernel instead of W4A4 mxf4. Forward DG_* too so the worker-side kernel/SymmBuffer selection matches the driver's intent. Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #54752 [ run ] triggered by Bot. Commit: |
|
PR_Github #54752 [ run ] completed with state
|
@coderabbitai summary
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.