-
Notifications
You must be signed in to change notification settings - Fork 101
Training scripts, tests, and config for llama3; very similar to ESM2 … #1319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: savitha/llama3-recipes-dataloader-add-tokenizer
Are you sure you want to change the base?
Training scripts, tests, and config for llama3; very similar to ESM2 … #1319
Conversation
| - _self_ | ||
|
|
||
| # Use tiny Llama config for fast convergence testing | ||
| model_tag: /workspaces/bionemo-framework/bionemo-recipes/recipes/llama3/tiny_llama_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a relative path is probably going to be safer here
|
|
||
| # Dataset configuration - use 2MB subset | ||
| dataset: | ||
| tokenizer_path: /workspaces/bionemo-framework/bionemo-recipes/models/llama3/nucleotide_fast_tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, use a relative path. I think we want something similar to the example_8m_checkpoint directory we use in the esm2 examples
| use_lazy_tokenization: true | ||
| load_dataset_kwargs: | ||
| path: "parquet" | ||
| data_files: "/workspaces/bionemo-framework/data/genomic_sequences_2mb.parquet" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd just put this in the llama3 recipe directory itself
490fc8f to
5c0316e
Compare
e9b891f to
cbb8328
Compare
Signed-off-by: savitha-eng <[email protected]>
…narios per the feedback Signed-off-by: savitha-eng <[email protected]>
5c0316e to
cda848a
Compare
2e540ed to
fc3d775
Compare
- Added use_stateful_dataloader parameter (defaults to False) - Switch between StatefulDataLoader and regular DataLoader - Set pin_memory=False when using StatefulDataLoader (BIONEMO-3246 workaround) - Matches ESM2 implementation pattern - All tests pass (8/8 dataset tests, 14/14 tokenizer tests) Signed-off-by: savitha-eng <[email protected]>
b3e99eb to
eae1e5c
Compare
- Added sequence_column parameter to create_tokenized_dataset and create_bshd_dataloader - Defaults to 'sequence' for backwards compatibility - Supports any column name (e.g., 'Text' for arcinstitute/opengenome2, 'nt_sequence' for SQLite data) - Validates column exists with helpful error messages - Removes hardcoded nt_sequence special case - All existing tests pass (8/8) with default parameter Signed-off-by: savitha-eng <[email protected]>
…native te Signed-off-by: savitha-eng <[email protected]>
Signed-off-by: savitha-eng <[email protected]>
Signed-off-by: savitha-eng <[email protected]>
Signed-off-by: savitha-eng <[email protected]>
- Add comprehensive distributed checkpointing tests (8 tests total) - Single and multi-GPU checkpoint save/resume for DDP and FSDP2 - Final model save tests for inference export - Scheduler resume tests - Disable pin_memory in dataloader due to PyTorch 2.9/torchdata 0.11 incompatibility - Add checkpoint verification to multi-GPU tests - Improve test documentation and docstrings - Add wandb project config field to avoid hydra struct errors Signed-off-by: Savitha Srinivasan <[email protected]>
- Added use_stateful_dataloader: false to all hydra configs (matches ESM2) - Updated train_ddp.py and train_fsdp2.py to conditionally pass dataloader to checkpoint functions - Updated test_distributed_checkpointing.py to enable stateful dataloader in all tests - Works around pin_memory issue (pytorch/pytorch#163102) by defaulting to regular DataLoader - Tests can still validate full checkpoint/resume with use_stateful_dataloader=true Signed-off-by: savitha-eng <[email protected]>
- Disable resume_from_checkpoint in convergence tests (test_train.py) - These tests don't need checkpointing, just convergence validation - Prevents NoneType error when use_stateful_dataloader=false - Enable use_stateful_dataloader in checkpointing tests (test_train_two_gpu.py) - Required for checkpoint save/resume functionality - Ensures dataloader state is preserved across checkpoints - Add use_stateful_dataloader to scheduler resume test (test_distributed_checkpointing.py) - Needed for phase 2 resume to work correctly All 26 tests now pass. Signed-off-by: Savitha Srinivasan <[email protected]>
548bb02 to
10d94c9
Compare
…native te
Description
Usage
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Note
By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Pre-submit Checklist