New Features
Load Checkpoint Callback (#1570)
We added support for Composer's LoadCheckpoint callback, which loads a checkpoint at a specified event. This enables use cases like loading model base weights with peft.
callbacks:
load_checkpoint:
load_path: /path/to/your/weights
Breaking Changes
Accumulate over tokens in a Batch for Training Loss (#1618,#1610,#1595)
We added a new flag accumulate_train_batch_on_tokens
which specifies whether training loss is accumulated over the number of tokens in a batch, rather than the number of samples. It is true by default. This will slightly change loss curves for models trained with padding. The old behavior can be recovered by simply setting this to False explicitly.
Default Run Name (#1611)
If no run name is provided, we now will default to using composer's randomly generated run names. (Previously, we defaulted to using "llm" for the run name.)
What's Changed
- Update mcli examples to use 0.13.0 by @irenedea in #1594
- Pass accumulate_train_batch_on_tokens through to composer by @dakinggg in #1595
- Loosen MegaBlocks version pin by @mvpatel2000 in #1597
- Add configurability for hf checkpointer register timeout by @dakinggg in #1599
- Loosen MegaBlocks to <1.0 by @mvpatel2000 in #1598
- Finetuning dataloader validation tweaks by @mvpatel2000 in #1600
- Bump onnx from 1.16.2 to 1.17.0 by @dependabot in #1604
- Remove TE from dockerfile and instead add as optional dependency by @snarayan21 in #1605
- Data prep on multiple GPUs by @eitanturok in #1576
- Add env var for configuring the maximum number of processes to use for dataset processing by @irenedea in #1606
- Updated error message for cluster check by @nancyhung in #1602
- Use fun default composer run names by @irenedea in #1611
- Ensure log messages are properly formatted again by @snarayan21 in #1614
- Add UC not enabled error for delta to json conversion by @irenedea in #1613
- Use a temporary directory for downloading finetuning dataset files by @irenedea in #1608
- Bump composer version to 0.26.0 by @irenedea in #1616
- Add loss generating token counts by @dakinggg in #1610
- Change accumulate_train_batch_on_tokens default to True by @dakinggg in #1618
- Bump version to 0.15.0.dev0 by @irenedea in #1621
- Add load checkpoint callback by @irenedea in #1570
Full Changelog: v0.13.0...v0.14.0