Skip to content

Conversation

@savitha-eng
Copy link
Collaborator

…native te

Description

Usage

TODO: Add code snippet

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

  • ciflow:skip - Skip all CI tests for this PR
  • ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
  • ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
  • ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
  • ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 12, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- _self_

# Use tiny Llama config for fast convergence testing
model_tag: /workspaces/bionemo-framework/bionemo-recipes/recipes/llama3/tiny_llama_config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a relative path is probably going to be safer here


# Dataset configuration - use 2MB subset
dataset:
tokenizer_path: /workspaces/bionemo-framework/bionemo-recipes/models/llama3/nucleotide_fast_tokenizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, use a relative path. I think we want something similar to the example_8m_checkpoint directory we use in the esm2 examples

use_lazy_tokenization: true
load_dataset_kwargs:
path: "parquet"
data_files: "/workspaces/bionemo-framework/data/genomic_sequences_2mb.parquet"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd just put this in the llama3 recipe directory itself

@savitha-eng savitha-eng force-pushed the savitha/llama3-recipes-dataloader-add-dataset branch from 490fc8f to 5c0316e Compare November 14, 2025 01:16
@savitha-eng savitha-eng force-pushed the savitha/llama3-recipes-dataloader-add-train-script branch from e9b891f to cbb8328 Compare November 14, 2025 01:21
@savitha-eng savitha-eng force-pushed the savitha/llama3-recipes-dataloader-add-dataset branch from 5c0316e to cda848a Compare November 14, 2025 19:44
@savitha-eng savitha-eng force-pushed the savitha/llama3-recipes-dataloader-add-train-script branch from 2e540ed to fc3d775 Compare November 14, 2025 19:46
- Added use_stateful_dataloader parameter (defaults to False)
- Switch between StatefulDataLoader and regular DataLoader
- Set pin_memory=False when using StatefulDataLoader (BIONEMO-3246 workaround)
- Matches ESM2 implementation pattern
- All tests pass (8/8 dataset tests, 14/14 tokenizer tests)

Signed-off-by: savitha-eng <[email protected]>
@savitha-eng savitha-eng force-pushed the savitha/llama3-recipes-dataloader-add-train-script branch from b3e99eb to eae1e5c Compare November 15, 2025 01:01
- Added sequence_column parameter to create_tokenized_dataset and create_bshd_dataloader
- Defaults to 'sequence' for backwards compatibility
- Supports any column name (e.g., 'Text' for arcinstitute/opengenome2, 'nt_sequence' for SQLite data)
- Validates column exists with helpful error messages
- Removes hardcoded nt_sequence special case
- All existing tests pass (8/8) with default parameter

Signed-off-by: savitha-eng <[email protected]>
Signed-off-by: savitha-eng <[email protected]>
Signed-off-by: savitha-eng <[email protected]>
Signed-off-by: savitha-eng <[email protected]>
- Add comprehensive distributed checkpointing tests (8 tests total)
  - Single and multi-GPU checkpoint save/resume for DDP and FSDP2
  - Final model save tests for inference export
  - Scheduler resume tests
- Disable pin_memory in dataloader due to PyTorch 2.9/torchdata 0.11 incompatibility
- Add checkpoint verification to multi-GPU tests
- Improve test documentation and docstrings
- Add wandb project config field to avoid hydra struct errors

Signed-off-by: Savitha Srinivasan <[email protected]>
- Added use_stateful_dataloader: false to all hydra configs (matches ESM2)
- Updated train_ddp.py and train_fsdp2.py to conditionally pass dataloader to checkpoint functions
- Updated test_distributed_checkpointing.py to enable stateful dataloader in all tests
- Works around pin_memory issue (pytorch/pytorch#163102) by defaulting to regular DataLoader
- Tests can still validate full checkpoint/resume with use_stateful_dataloader=true

Signed-off-by: savitha-eng <[email protected]>
- Disable resume_from_checkpoint in convergence tests (test_train.py)
  - These tests don't need checkpointing, just convergence validation
  - Prevents NoneType error when use_stateful_dataloader=false
- Enable use_stateful_dataloader in checkpointing tests (test_train_two_gpu.py)
  - Required for checkpoint save/resume functionality
  - Ensures dataloader state is preserved across checkpoints
- Add use_stateful_dataloader to scheduler resume test (test_distributed_checkpointing.py)
  - Needed for phase 2 resume to work correctly

All 26 tests now pass.

Signed-off-by: Savitha Srinivasan <[email protected]>
@savitha-eng savitha-eng force-pushed the savitha/llama3-recipes-dataloader-add-train-script branch from 548bb02 to 10d94c9 Compare November 15, 2025 02:29
Base automatically changed from savitha/llama3-recipes-dataloader-add-dataset to savitha/llama3-recipes-dataloader-add-tokenizer November 17, 2025 14:38
@savitha-eng savitha-eng marked this pull request as ready for review November 17, 2025 18:10
@savitha-eng savitha-eng marked this pull request as draft November 17, 2025 19:23
@savitha-eng savitha-eng reopened this Nov 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants