Training scripts, tests, and config for llama3; very similar to ESM2 … #1319

savitha-eng · 2025-11-12T08:33:26Z

…native te

Description

Usage

TODO: Add code snippet

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

ciflow:skip - Skip all CI tests for this PR
ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

copy-pr-bot · 2025-11-12T08:33:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

bionemo-recipes/recipes/llama3/train_fsdp2.py

bionemo-recipes/recipes/llama3/tests/test_train.py

bionemo-recipes/recipes/llama3/hydra_config/L0_convergence_ddp.yaml

pstjohn · 2025-11-12T19:11:36Z

bionemo-recipes/recipes/llama3/hydra_config/L0_convergence_ddp.yaml

+  - _self_
+
+# Use tiny Llama config for fast convergence testing
+model_tag: /workspaces/bionemo-framework/bionemo-recipes/recipes/llama3/tiny_llama_config


a relative path is probably going to be safer here

pstjohn · 2025-11-12T19:12:16Z

bionemo-recipes/recipes/llama3/hydra_config/L0_convergence_ddp.yaml

+
+# Dataset configuration - use 2MB subset
+dataset:
+  tokenizer_path: /workspaces/bionemo-framework/bionemo-recipes/models/llama3/nucleotide_fast_tokenizer


same here, use a relative path. I think we want something similar to the example_8m_checkpoint directory we use in the esm2 examples

pstjohn · 2025-11-12T19:12:39Z

bionemo-recipes/recipes/llama3/hydra_config/L0_convergence_ddp.yaml

+  use_lazy_tokenization: true
+  load_dataset_kwargs:
+    path: "parquet"
+    data_files: "/workspaces/bionemo-framework/data/genomic_sequences_2mb.parquet"


i'd just put this in the llama3 recipe directory itself

Signed-off-by: savitha-eng <[email protected]>

…narios per the feedback Signed-off-by: savitha-eng <[email protected]>

- Added use_stateful_dataloader parameter (defaults to False) - Switch between StatefulDataLoader and regular DataLoader - Set pin_memory=False when using StatefulDataLoader (BIONEMO-3246 workaround) - Matches ESM2 implementation pattern - All tests pass (8/8 dataset tests, 14/14 tokenizer tests) Signed-off-by: savitha-eng <[email protected]>

- Added sequence_column parameter to create_tokenized_dataset and create_bshd_dataloader - Defaults to 'sequence' for backwards compatibility - Supports any column name (e.g., 'Text' for arcinstitute/opengenome2, 'nt_sequence' for SQLite data) - Validates column exists with helpful error messages - Removes hardcoded nt_sequence special case - All existing tests pass (8/8) with default parameter Signed-off-by: savitha-eng <[email protected]>

…native te Signed-off-by: savitha-eng <[email protected]>

Signed-off-by: savitha-eng <[email protected]>

- Add comprehensive distributed checkpointing tests (8 tests total) - Single and multi-GPU checkpoint save/resume for DDP and FSDP2 - Final model save tests for inference export - Scheduler resume tests - Disable pin_memory in dataloader due to PyTorch 2.9/torchdata 0.11 incompatibility - Add checkpoint verification to multi-GPU tests - Improve test documentation and docstrings - Add wandb project config field to avoid hydra struct errors Signed-off-by: Savitha Srinivasan <[email protected]>

- Added use_stateful_dataloader: false to all hydra configs (matches ESM2) - Updated train_ddp.py and train_fsdp2.py to conditionally pass dataloader to checkpoint functions - Updated test_distributed_checkpointing.py to enable stateful dataloader in all tests - Works around pin_memory issue (pytorch/pytorch#163102) by defaulting to regular DataLoader - Tests can still validate full checkpoint/resume with use_stateful_dataloader=true Signed-off-by: savitha-eng <[email protected]>

- Disable resume_from_checkpoint in convergence tests (test_train.py) - These tests don't need checkpointing, just convergence validation - Prevents NoneType error when use_stateful_dataloader=false - Enable use_stateful_dataloader in checkpointing tests (test_train_two_gpu.py) - Required for checkpoint save/resume functionality - Ensures dataloader state is preserved across checkpoints - Add use_stateful_dataloader to scheduler resume test (test_distributed_checkpointing.py) - Needed for phase 2 resume to work correctly All 26 tests now pass. Signed-off-by: Savitha Srinivasan <[email protected]>

pstjohn reviewed Nov 12, 2025

View reviewed changes

bionemo-recipes/recipes/llama3/train_fsdp2.py Outdated Show resolved Hide resolved

pstjohn reviewed Nov 12, 2025

View reviewed changes

bionemo-recipes/recipes/llama3/tests/test_train.py Show resolved Hide resolved

pstjohn reviewed Nov 12, 2025

View reviewed changes

bionemo-recipes/recipes/llama3/hydra_config/L0_convergence_ddp.yaml Show resolved Hide resolved

pstjohn reviewed Nov 12, 2025

View reviewed changes

savitha-eng force-pushed the savitha/llama3-recipes-dataloader-add-dataset branch from 490fc8f to 5c0316e Compare November 14, 2025 01:16

savitha-eng force-pushed the savitha/llama3-recipes-dataloader-add-train-script branch from e9b891f to cbb8328 Compare November 14, 2025 01:21

savitha-eng added 2 commits November 14, 2025 19:43

Dataset class and tests for LLAMA3; support for streaming, parquet

ec326f9

Signed-off-by: savitha-eng <[email protected]>

Modified tests to more throughly test multi sequence and batching sce…

cda848a

…narios per the feedback Signed-off-by: savitha-eng <[email protected]>

savitha-eng force-pushed the savitha/llama3-recipes-dataloader-add-dataset branch from 5c0316e to cda848a Compare November 14, 2025 19:44

savitha-eng force-pushed the savitha/llama3-recipes-dataloader-add-train-script branch from 2e540ed to fc3d775 Compare November 14, 2025 19:46

savitha-eng force-pushed the savitha/llama3-recipes-dataloader-add-train-script branch from b3e99eb to eae1e5c Compare November 15, 2025 01:01

savitha-eng added 8 commits November 15, 2025 02:17

Training scripts, tests, and config for llama3; very similar to ESM2 …

fdfa773

…native te Signed-off-by: savitha-eng <[email protected]>

address refactoring feedback

21b4469

Signed-off-by: savitha-eng <[email protected]>

readme changes

dd147bc

Signed-off-by: savitha-eng <[email protected]>

test for two gpu

33a7553

Signed-off-by: savitha-eng <[email protected]>

savitha-eng force-pushed the savitha/llama3-recipes-dataloader-add-train-script branch from 548bb02 to 10d94c9 Compare November 15, 2025 02:29

Base automatically changed from savitha/llama3-recipes-dataloader-add-dataset to savitha/llama3-recipes-dataloader-add-tokenizer November 17, 2025 14:38

savitha-eng marked this pull request as ready for review November 17, 2025 18:10

savitha-eng requested review from dorotat-nv, jstjohn, jwilber and trvachov as code owners November 17, 2025 18:10

savitha-eng requested review from cspades and jomitchellnv as code owners November 17, 2025 18:10

savitha-eng marked this pull request as draft November 17, 2025 19:23

savitha-eng closed this Nov 17, 2025

savitha-eng reopened this Nov 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training scripts, tests, and config for llama3; very similar to ESM2 … #1319

Training scripts, tests, and config for llama3; very similar to ESM2 … #1319

Uh oh!

savitha-eng commented Nov 12, 2025

Uh oh!

copy-pr-bot bot commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pstjohn Nov 12, 2025

Uh oh!

pstjohn Nov 12, 2025

Uh oh!

pstjohn Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Training scripts, tests, and config for llama3; very similar to ESM2 … #1319

Are you sure you want to change the base?

Training scripts, tests, and config for llama3; very similar to ESM2 … #1319

Uh oh!

Conversation

savitha-eng commented Nov 12, 2025

Description

Usage

Type of changes

CI Pipeline Configuration

Authorizing CI Runs

Pre-submit Checklist

Uh oh!

copy-pr-bot bot commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pstjohn Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

pstjohn Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

pstjohn Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants