Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion llm_moe_pretraining/nemo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Large Language Model pretraining - DeepSeek V3 671B (Mixture of Experts).

To closely follow the training recipe from the [DeepSeek-v3 Technical report](https://arxiv.org/html/2412.19437v2) which employs a batch size scheduling strategy, where the batch size is gradually increased from 3072 to 15360 in the training of the first 469B tokens, and then keeps 15360 in the remaining training of 4.8T tokens, the benchmark enforces all submissions to use a GBS of atleast 15360. This ensures that the benchmark is representative of DeepSeek-v3 pretraining while maintaining fairness because the task force experiments showed that convergence in the benchmarking region can be much faster with GBS << 15360.
To closely follow the training recipe from the [DeepSeek-v3 Technical report](https://arxiv.org/html/2412.19437v2) which employs a batch size scheduling strategy, where the batch size is gradually increased from 3072 to 15360 in the training of the first 469B tokens, and then keeps 15360 in the remaining training of 14.3T tokens, the benchmark enforces all submissions to use a GBS of atleast 15360. This ensures that the benchmark is representative of DeepSeek-v3 pretraining while maintaining fairness because the task force experiments showed that convergence in the benchmarking region can be much faster with GBS << 15360.

* **GBS<15360 - not allowed**
* GBS>=15360: `opt_base_learning_rate = 0.000024 * sqrt(GBS / 16384)`, rounded to the **9th** decimal place.
Expand Down
Loading