Skip to content

Conversation

@savitha-eng
Copy link
Collaborator

Description

Adds comprehensive training infrastructure and distributed testing for the llama3_native_te recipe, following esm2 native te as an example. This PR aads training scripts, distributed checkpointing, and multi-GPU tests to the new llama3_native_te directory structure.

Key additions:

  • DDP and FSDP2 training scripts with TransformerEngine support
  • Distributed checkpointing with state save/load capabilities
  • Comprehensive test suite including multi-GPU distributed training tests
  • Hydra configuration files for sanity and convergence testing
  • Tiny Llama checkpoint (~9.6M params) for fast CI testing

Tests included:

  • Single-GPU training tests (DDP and FSDP2)
  • Multi-GPU distributed training tests (2-GPU DDP and FSDP2)
  • Distributed checkpointing tests (save/load/resume)

Usage

Run training with DDP:

cd bionemo-recipes/recipes/llama3_native_te
torchrun --nproc_per_node=2 train_ddp.py --config-name L0_sanity

Run training with FSDP2:

cd bionemo-recipes/recipes/llama3_native_te
torchrun --nproc_per_node=2 train_fsdp2.py --config-name L0_sanity

Run tests:

cd bionemo-recipes/recipes/llama3_native_te
pytest tests/ -v

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

  • ciflow:skip - Skip all CI tests for this PR
  • ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
  • ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
  • ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
  • ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Note: Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 17, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@savitha-eng savitha-eng marked this pull request as ready for review November 17, 2025 21:00
@savitha-eng savitha-eng changed the title Add \ddp/fsdp2 train scripts, tests, configs and utilities to llama3_native_te recipe Add ddp/fsdp2 train scripts, tests, configs and utilities to llama3_native_te recipe Nov 17, 2025
…add gradient checkpoint support

Signed-off-by: Savitha Srinivasan <[email protected]>
Copy link
Collaborator

@pstjohn pstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm as a start, i can put in a PR to fix a few nits as part of moving this to TE layers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants