Skip to content

Conversation

@Yui-Koi
Copy link
Contributor

@Yui-Koi Yui-Koi commented Oct 20, 2025

Add flash-muon as git submodule for optimized Newton-Schulz iterations.
Implements config-based selection between Muon and FlashMuon optimizers.

Technical Details:

  • Flash-muon reduces NS5 matmul FLOPs by ~50% via fused CUDA kernels
  • Speedup is dimension and hardware dependent (1.2-2× on optimizer step)
  • Only affects 2D weight matrices (muon params), not embeddings/norms (adamw)
  • Expected real training speedup ~1-2% based on param distribution and profiling estimates.

Implementation:

  • Added as git submodule instead of pip package for upstream tracking
  • Config flag use_flash_muon (default: True) enables runtime selection
  • Modified trainer.py only; experiments untouched for reproducibility
  • Ternary operator selects optimizer class based on config

Performance Impact:
Wall-clock speedup varies by GPU and matrix dimension:
H800: 0.9-1.56× (overhead at small dims, gains at large)
H20: 1.68-2.03× (consistent improvement)
A100: 1.19-1.78× (solid gains)
4090: 1.0-1.90× (best at large dimensions)

Optimizer step is ~3-5% of training time, muon handles ~75.7% of params (calculated from moe model config), and assuming median speedup of 1.6x so 0.375, then theoretically the end to end speedup would be 0.04 × 0.757 × 0.375 ≈ 1.11% faster training

Compatibility:

  • Experiments unchanged to preserve ablation reproducibility
  • Config fallback: set use_flash_muon=False if issues arise
  • Fresh clones require: git submodule update --init --recursive

Refs: https://github.com/nil0x9/flash-muon
Benchmarks: https://github.com/nil0x9/flash-muon#benchmarks

Breaking Changes: None

ps: I didn't have the gpu time needed to benchmark it properly so I estimated the gains but it will be net positive improvement based off my calculations

@Yui-Koi Yui-Koi marked this pull request as draft October 20, 2025 09:28
@Yui-Koi Yui-Koi force-pushed the feat/add-flash-muon branch from 969b9b5 to e3728c1 Compare October 20, 2025 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant