Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Cost about bd3-lms. #5

Closed
Wiselnn570 opened this issue Mar 17, 2025 · 1 comment
Closed

Training Cost about bd3-lms. #5

Wiselnn570 opened this issue Mar 17, 2025 · 1 comment

Comments

@Wiselnn570
Copy link

Congratulations on your ICLR oral presentation. Could you share the training cost details for the main experiments?

@mariannearr
Copy link
Collaborator

mariannearr commented Mar 18, 2025

Thank you for your interest in our work! Training BD3-LMs consists of a pretraining and a finetuning step, that in combination use 1M gradient updates. While pretraining is not strictly necessary, we find it is helpful for training multiple BD3-LMs with varying block sizes L', as the forward pass of BD3-LM (for block size L' < L) is more expensive (70-85% slower training).

  • Pretraining. We pre-train a base BD3-LM (where the block size L'=L, corresponding to full diffusion) for 850K gradient updates. On LM1B, this takes ~3 days using 8xA5000 GPUs. On OWT, this takes ~9 days using 8xA100s (batch size 32)
  • Finetuning. We fine-tune BD3-LMs for any desired block size L' for 150K gradient updates. On LM1B, this takes ~1 day using 8xA5000 GPUs. On OWT, this takes ~3 days using 8xA100s (batch size 16)

The BD3-LM estimates assume that you're using FlexAttention as the backend, following our default scripts. We recommend using a PyTorch version (>2.5) that is compatible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants