Add ddp/fsdp2 train scripts, tests, configs and utilities to llama3_native_te recipe #1327

savitha-eng · 2025-11-17T21:00:44Z

Description

Adds comprehensive training infrastructure and distributed testing for the llama3_native_te recipe, following esm2 native te as an example. This PR aads training scripts, distributed checkpointing, and multi-GPU tests to the new llama3_native_te directory structure.

Key additions:

DDP and FSDP2 training scripts with TransformerEngine support
Distributed checkpointing with state save/load capabilities
Comprehensive test suite including multi-GPU distributed training tests
Hydra configuration files for sanity and convergence testing
Tiny Llama checkpoint (~9.6M params) for fast CI testing

Tests included:

Single-GPU training tests (DDP and FSDP2)
Multi-GPU distributed training tests (2-GPU DDP and FSDP2)
Distributed checkpointing tests (save/load/resume)

Usage

Run training with DDP:

cd bionemo-recipes/recipes/llama3_native_te
torchrun --nproc_per_node=2 train_ddp.py --config-name L0_sanity

Run training with FSDP2:

cd bionemo-recipes/recipes/llama3_native_te
torchrun --nproc_per_node=2 train_fsdp2.py --config-name L0_sanity

Run tests:

cd bionemo-recipes/recipes/llama3_native_te
pytest tests/ -v

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

ciflow:skip - Skip all CI tests for this PR
ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Note: Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

Signed-off-by: Savitha Srinivasan <[email protected]>

copy-pr-bot · 2025-11-17T21:00:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…add gradient checkpoint support Signed-off-by: Savitha Srinivasan <[email protected]>

pstjohn

lgtm as a start, i can put in a PR to fix a few nits as part of moving this to TE layers

Add ESM2 style ddp/fsdp2 train scripts, tests,configs and utilities

153a5c0

Signed-off-by: Savitha Srinivasan <[email protected]>

savitha-eng marked this pull request as ready for review November 17, 2025 21:00

savitha-eng requested review from cspades, dorotat-nv, jomitchellnv, jstjohn, jwilber, pstjohn and trvachov as code owners November 17, 2025 21:00

savitha-eng changed the title ~~Add \ddp/fsdp2 train scripts, tests, configs and utilities to llama3_native_te recipe~~ Add ddp/fsdp2 train scripts, tests, configs and utilities to llama3_native_te recipe Nov 17, 2025

Fix bugs with statefuldataloader checkpoint loading, add logging and …

67297fa

…add gradient checkpoint support Signed-off-by: Savitha Srinivasan <[email protected]>

pstjohn approved these changes Nov 21, 2025

View reviewed changes

savitha-eng enabled auto-merge November 21, 2025 04:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ddp/fsdp2 train scripts, tests, configs and utilities to llama3_native_te recipe #1327

Add ddp/fsdp2 train scripts, tests, configs and utilities to llama3_native_te recipe #1327

savitha-eng commented Nov 17, 2025

Uh oh!

copy-pr-bot bot commented Nov 17, 2025

Uh oh!

pstjohn left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add ddp/fsdp2 train scripts, tests, configs and utilities to llama3_native_te recipe #1327

Are you sure you want to change the base?

Add ddp/fsdp2 train scripts, tests, configs and utilities to llama3_native_te recipe #1327

Conversation

savitha-eng commented Nov 17, 2025

Description

Usage

Run training with DDP:

Run training with FSDP2:

Run tests:

Type of changes

CI Pipeline Configuration

Pre-submit Checklist

Uh oh!

copy-pr-bot bot commented Nov 17, 2025

Uh oh!

pstjohn left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants