From f7adbc8d833d46217ebe3ab624a3fdfac89893d1 Mon Sep 17 00:00:00 2001 From: Sahebjot singh Date: Wed, 17 Jun 2026 17:53:27 +0530 Subject: [PATCH] deepseek: update README with corrected training token count 14.8T total tokens, after 469B we have ~14.3T tokens --- llm_moe_pretraining/nemo/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/llm_moe_pretraining/nemo/README.md b/llm_moe_pretraining/nemo/README.md index ba5f44efc..01d815e80 100644 --- a/llm_moe_pretraining/nemo/README.md +++ b/llm_moe_pretraining/nemo/README.md @@ -2,7 +2,7 @@ Large Language Model pretraining - DeepSeek V3 671B (Mixture of Experts). -To closely follow the training recipe from the [DeepSeek-v3 Technical report](https://arxiv.org/html/2412.19437v2) which employs a batch size scheduling strategy, where the batch size is gradually increased from 3072 to 15360 in the training of the first 469B tokens, and then keeps 15360 in the remaining training of 4.8T tokens, the benchmark enforces all submissions to use a GBS of atleast 15360. This ensures that the benchmark is representative of DeepSeek-v3 pretraining while maintaining fairness because the task force experiments showed that convergence in the benchmarking region can be much faster with GBS << 15360. +To closely follow the training recipe from the [DeepSeek-v3 Technical report](https://arxiv.org/html/2412.19437v2) which employs a batch size scheduling strategy, where the batch size is gradually increased from 3072 to 15360 in the training of the first 469B tokens, and then keeps 15360 in the remaining training of 14.3T tokens, the benchmark enforces all submissions to use a GBS of atleast 15360. This ensures that the benchmark is representative of DeepSeek-v3 pretraining while maintaining fairness because the task force experiments showed that convergence in the benchmarking region can be much faster with GBS << 15360. * **GBS<15360 - not allowed** * GBS>=15360: `opt_base_learning_rate = 0.000024 * sqrt(GBS / 16384)`, rounded to the **9th** decimal place.