Add ddp/fsdp2 train scripts, tests, configs and utilities to llama3_native_te recipe #1327
+2,526
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Adds comprehensive training infrastructure and distributed testing for the
llama3_native_terecipe, following esm2 native te as an example. This PR aads training scripts, distributed checkpointing, and multi-GPU tests to the newllama3_native_tedirectory structure.Key additions:
Tests included:
Usage
Run training with DDP:
cd bionemo-recipes/recipes/llama3_native_te torchrun --nproc_per_node=2 train_ddp.py --config-name L0_sanityRun training with FSDP2:
cd bionemo-recipes/recipes/llama3_native_te torchrun --nproc_per_node=2 train_fsdp2.py --config-name L0_sanityRun tests:
cd bionemo-recipes/recipes/llama3_native_te pytest tests/ -vType of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
ciflow:skip- Skip all CI tests for this PRciflow:notebooks- Run Jupyter notebooks execution tests for bionemo2ciflow:slow- Run slow single GPU integration tests marked as@pytest.mark.slowfor bionemo2ciflow:all- Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.ciflow:all-recipes- Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.Note: Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Pre-submit Checklist