feat(optimizers): integrate flash-muon with runtime selection #39
+42
−22
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add flash-muon as git submodule for optimized Newton-Schulz iterations.
Implements config-based selection between Muon and FlashMuon optimizers.
Technical Details:
Implementation:
use_flash_muon(default: True) enables runtime selectionPerformance Impact:
Wall-clock speedup varies by GPU and matrix dimension:
H800: 0.9-1.56× (overhead at small dims, gains at large)
H20: 1.68-2.03× (consistent improvement)
A100: 1.19-1.78× (solid gains)
4090: 1.0-1.90× (best at large dimensions)
Optimizer step is ~3-5% of training time, muon handles ~75.7% of params (calculated from moe model config), and assuming median speedup of 1.6x so 0.375, then theoretically the end to end speedup would be 0.04 × 0.757 × 0.375 ≈ 1.11% faster training
Compatibility:
Refs: https://github.com/nil0x9/flash-muon
Benchmarks: https://github.com/nil0x9/flash-muon#benchmarks
Breaking Changes: None
ps: I didn't have the gpu time needed to benchmark it properly so I estimated the gains but it will be net positive improvement based off my calculations