You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your interest in our work! Training BD3-LMs consists of a pretraining and a finetuning step, that in combination use 1M gradient updates. While pretraining is not strictly necessary, we find it is helpful for training multiple BD3-LMs with varying block sizes L', as the forward pass of BD3-LM (for block size L' < L) is more expensive (70-85% slower training).
Pretraining. We pre-train a base BD3-LM (where the block size L'=L, corresponding to full diffusion) for 850K gradient updates. On LM1B, this takes ~3 days using 8xA5000 GPUs. On OWT, this takes ~9 days using 8xA100s (batch size 32)
Finetuning. We fine-tune BD3-LMs for any desired block size L' for 150K gradient updates. On LM1B, this takes ~1 day using 8xA5000 GPUs. On OWT, this takes ~3 days using 8xA100s (batch size 16)
The BD3-LM estimates assume that you're using FlexAttention as the backend, following our default scripts. We recommend using a PyTorch version (>2.5) that is compatible
Congratulations on your ICLR oral presentation. Could you share the training cost details for the main experiments?
The text was updated successfully, but these errors were encountered: