Training Cost about bd3-lms. #5

Wiselnn570 · 2025-03-17T07:30:12Z

Congratulations on your ICLR oral presentation. Could you share the training cost details for the main experiments?

mariannearr · 2025-03-18T04:29:51Z

Thank you for your interest in our work! Training BD3-LMs consists of a pretraining and a finetuning step, that in combination use 1M gradient updates. While pretraining is not strictly necessary, we find it is helpful for training multiple BD3-LMs with varying block sizes L', as the forward pass of BD3-LM (for block size L' < L) is more expensive (70-85% slower training).

Pretraining. We pre-train a base BD3-LM (where the block size L'=L, corresponding to full diffusion) for 850K gradient updates. On LM1B, this takes ~3 days using 8xA5000 GPUs. On OWT, this takes ~9 days using 8xA100s (batch size 32)
Finetuning. We fine-tune BD3-LMs for any desired block size L' for 150K gradient updates. On LM1B, this takes ~1 day using 8xA5000 GPUs. On OWT, this takes ~3 days using 8xA100s (batch size 16)

The BD3-LM estimates assume that you're using FlexAttention as the backend, following our default scripts. We recommend using a PyTorch version (>2.5) that is compatible

mariannearr closed this as completed Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Cost about bd3-lms. #5

Training Cost about bd3-lms. #5

Wiselnn570 commented Mar 17, 2025

mariannearr commented Mar 18, 2025 •

edited

Loading

Training Cost about bd3-lms. #5

Training Cost about bd3-lms. #5

Comments

Wiselnn570 commented Mar 17, 2025

mariannearr commented Mar 18, 2025 • edited Loading

mariannearr commented Mar 18, 2025 •

edited

Loading