20 Dec 16:30

cspades

fd441fa

NVIDIA BioNeMo Framework v2.2 Latest

Latest

New Features

Small Molecule Featurization
- Implemented elementary and advanced atom, bond, and full molecule featurizers.
GH200 Support for BioNeMo
- Added a Dockerfile.arm that builds a BioNeMo container that runs on GH200 machines.
- Publish a version of the BioNeMo container that supports multiple architectures to NGC.

Updates & Improvements

Single-Cell Dataloader (SCDL)
- Changed metadata storage to parquet files, which creates a 30x speed up when iterating over a large dataset.
- Added functionality to concatenate several anndata files without doubling disk memory usage.
ESM2
- Added support for SIGTERM preemption checkpoint saving.
- Moved ESM-2 and Geneformer training scripts to new executables, train_esm2 and train_geneformer, respectively.
- Moved inference script to a new executable infer_esm2, and deprecated the inference example in the fine-tuning tutorial.
- Added new Jupyter notebook tutorials for inference and zero-shot protein design. These notebooks can be deployed on the cloud resources as a brev.dev launchable.

Known Issues

Loading a checkpoint for Geneformer inference on H100 has a known regression in accuracy. Work is in progress to resolve by next release.

Changes

Move ESM2 scripts to sub-packages by @farhadrgh in #406
WAR: sets checkpoint filename to be more unique by @skothenhill-nv in #429
Update NeMo and Megatron to TOT by @pstjohn in #424
re-enable merge groups to trigger blossom-ci by @pstjohn in #431
Revert "re-enable merge groups to trigger blossom-ci" by @pstjohn in #434
Updated notebook, and nemo2 checkpoint with geneformer by @jstjohn in #430
add pre-emption callback to esm2 train by @pstjohn in #433
add rdkit dependency to bionemo-geometric by @sveccham in #432
eliminate the need for NGC login - bionemo2 by @dorotat-nv in #440
Add documentation and release info to README by @sirelkhatim in #447
Bump 3rdparty/Megatron-LM from aded519 to 5438d15 by @dependabot in #444
Launchable notebooks in docs! by @jstjohn in #451
Cache dev build from our nightly public container by @jstjohn in #462
set num_workers to 1 for esm2 tests by @pstjohn in #461
ESM2 Tutorial Updates by @farhadrgh in #426
BugFix: fix bugs on bionemo-size-aware-batching by @guoqing-zhou in #449
Fix typos in geneformer benchmark description by @jstjohn in #470
Pillow version bump into main by @polinabinder1 in #465
Refactor SCDL Row Feature Index for Performance Improvement (Rebased) by @savitha-eng in #466
pin correct tornado requirement by @polinabinder1 in #474
Updating Brev.Dev documentation by @polinabinder1 in #483
Add release notes for v2.1 by @tshimko-nv in #468
Update VERSION by @polinabinder1 in #488
Atom and bond features by @sveccham in #453
Molecule featurizer and molecule graph by @sveccham in #484
hillst/bionemo noodles by @skothenhill-nv in #458
update collate mask_value by @pstjohn in #485
override checkpoint precision by @farhadrgh in #475
JSON -> YAML for CLI by @skothenhill-nv in #436
[QA Bug] Remove NGC dependency by @farhadrgh in #494
Bump 3rdparty/NeMo from e2b0f0e to 06e6703 by @dependabot in #486
Bump 3rdparty/Megatron-LM from 5438d15 to 844119f by @dependabot in #496
change source for coverage report by @pstjohn in #495
Pstjohn/stop and go test non validation by @pstjohn in #476
Add support on num steps for learning rate scheduler by @sichu2023 in #489
Initial compatibility testing images by @malcolmgreaves in #438
Conda-Based Compatibility Test Images by @malcolmgreaves in #507
Instructions on compatibility image build by @malcolmgreaves in #512
Formatting by @malcolmgreaves in #513
Pstjohn/fix ci by @pstjohn in #515
[FEA][webdatamodule]: support webdataset invocable by @DejunL in #501
GH200 support by @gagank1 in #369
Remove quotes for Jupyter command on startup in init guide by @tshimko-nv in #523
Reduce esm2 and geneformer test burden by @sichu2023 in #499
[v2.2] Publish release notes for BioNeMo FW v2.2. by @cspades in #522
Disable validation/test stages in ESM-2 and Geneformer by @sichu2023 in #492
CI HOTFIX: ignore inrun_pytest.sh a notebook by @dorotat-nv in #526
added NeMoLogger unit tests by @dorotat-nv in #511
Bump 3rdparty/Megatron-LM from 844119f to 99f23d2 by @dependabot in #528
[cye/wandb-fix] Fix WandB issue. by @cspades in #530
xFail known bad tests on H100 and fix CVEs by @gagank1 in #547

New Contributors

@sveccham made their first contribution in #432
@sirelkhatim made their first contribution in #447

Full Changelog: v2.1...v2.2

Contributors

jstjohn, malcolmgreaves, and 15 other contributors

Assets 2

21 Nov 00:33

polinabinder1

v2.1

cd4f48a

NVIDIA BioNeMo Framework 2.1

New Features:

ESM2 Implementation
- Updated the ESM-2 Model Card with detailed performance benchmarks comparing BioNeMo2 training against vanilla pytorch.
- Added ESM-2 inference endpoint for evaluating pre-trained models
Size-Aware Batching
- Added SizeAwareBatchSampler, a pytorch data sampler that batches elements of varying sizes while ensuring that the total size of each batch does not exceed a specified maximum.
- Added BucketBatchSampler, another pytorch data sampler that groups elements of varying sizes based on predefined bucket ranges, and create batches with elements from each bucket to ensure that each batch has elements with homogeneous sizes.
CLI Support
- Added pydantic interface for pretraining jobs via parsing JSON configuration files that enables passing customized Model and DataModules classes.
- Implemented pydantic configuration for Geneformer and ESM2 pretraining and finetuning.
- Added 'recipes' for generating validated JSON files to be used with pydantic interface.
- Added installable scripts for 2/3 respectively, bionemo-esm2-recipe, bionemo-esm2-train, bionemo-geneformer-recipe, bionemo-geneformer-train.
Geneformer support in BioNeMo2:
- Tested pre-training scripts and fine-tuning example scripts that can be used as a starting point for users to create custom derivative models.
- Geneformer 10M and 106M checkpoints ported from BioNeMo v1 into BioNeMo v2 available and included in documentation.
- Added inference scripts
Documentation
- Cell type classification example notebook which covers the process of converting anndata into our internal format, and running inference on that data with a geneformer checkpoint, as well as making use of the inference results.
- Updated Getting Started guide, ESM-2 tutorials
- Added Frequently Asked Questions (FAQ) page

Changes

Final October docs edits by @tshimko-nv in #331
Update container location and tag for 2.0 release by @tshimko-nv in #337
Remove broken Release Notes links from v2.0 docs build by @tshimko-nv in #343
Tell pytest to ignore 3rdparty/{NeMo,MegatronLM} by @malcolmgreaves in #61
Add back the removed bionemo-core sub-package by @malcolmgreaves in #25
Fix bionemo-size-aware-batching, standardize pyproject.toml's & dependencies by @malcolmgreaves in #284
Add check bug fix label workflow by @yzhang123 in #250
Adds geneformer overview by @skothenhill-nv in #279
Add ESM2 Dataset and Datamodule by @pstjohn in #78
Test checkpoint IO loss is close to expected. by @jstjohn in #37
fix post-create command by @pstjohn in #152
Drop dependency to internal docs by @farhadrgh in #303
Add initial configuration for mike (version management for docs) by @tshimko-nv in #330
Update ESM2 model card with benchmarks by @pstjohn in #341
Geneformer PEFT by @gwarmstrong in #155
Update initialization in response to VDR by @tshimko-nv in #334
Add GitHub workflow by @ohadmo in #9
Reorganize bionemo-contrib into namespace packages by @malcolmgreaves in #51
Improve ESM2 pretraining tutorial from VDR feedback by @tshimko-nv in #336
install geometric dependencies before invalidating caches with source copy by @pstjohn in #224
ESM2 LoRA by @gwarmstrong in #218
chown /usr/local's dist-packages to allow editing them in the devcontainer by @pstjohn in #111
add search highlight + code copy capabilities by @jwilber in #102
ESM2 implementation by @farhadrgh in #28
Fix broken docs links on mike build by @tshimko-nv in #344
Updates to Getting Started docs by @tshimko-nv in #179
fix post-create command by @pstjohn in #88
refactor doc structure and look by @jwilber in #143
Make ruff check pre-commit hook follow what CI does by @malcolmgreaves in #201
Add bionemo-gemoetric: A component library for PyTorch Geometric Models & Data by @malcolmgreaves in #110
[FEA] size-aware batching: a package for creating mini-batch in a memory consumption-aware manner by @DejunL in #168
ESM2 Finetune bug fix and update by @farhadrgh in #197
add dev tools to devcontainer build by @pstjohn in #210
places the ptl artifacts ignore lines to the root directory only. by @skothenhill-nv in #21
Jared/v2 main/nvidia styles by @jwilber in #101
rename bionemo-fw-ea to bionemo-framework by @yzhang123 in #292
Add BERT-style masking function by @pstjohn in #55
Add perplexity logging by @sichu2023 in #144
support nsys profiling on ESM2, add downstream improvements to hit P0 perf by @sichu2023 in #300
trivial commit to bionemo2 by @broland-hat in #19
Add geneformer bionemo1 disclaimer by @jstjohn in #278
Split out the lightning example tutorial by @jstjohn in #67
Move v2 commits over. by @jstjohn in #8
Add documentation covering megatron and code structure rationalle by @jstjohn in #153
try out gh page url to resolve 404 error by @jwilber in #233
lowercase file name so mkdocs picks up correctly by @jwilber in #173
use importlib resources for files by @pstjohn in #178
add nemo-run as a git submodule by @pstjohn in #186
Add module for loading test data. by @pstjohn in #120
LightningDataModule for webdataset by @DejunL in #100
Update dependency tags to match PR #36, and try to fix test failure by @jstjohn in #39
Change to gelu default from relu which is what we actually used before by @jstjohn in #20
Jwilber/load nb from subpackages by @jwilber in #128
Use github runners to run pre-commit hooks by @pstjohn in #42
Bump 3rdparty/NeMo from ff7c614 to 8f0d0c7 by @dependabot in #145
Add a tested function to see if model parallel is enabled by @jstjohn in #175
Handle special tokens in the bert masking function by @pstjohn in #99
Fix all license headers to Apache by @trvachov in #347
add dependabot file by @pstjohn in #161
Checkpointing example with Geneformer by @skothenhill-nv in #24
epoch-level shuffling in ESM2 dataset by @pstjohn in #150
Bump 3rdparty/Megatron-LM from 0bda578 to 08e80b0 by @dependabot in #183
move CI scripts to central location by @pstjohn in #131
setuptools sub-package local vs. publish by @malcolmgreaves in #133
Nested weight munging fine-tuning/continue training example and test for example model and geneformer. by @jstjohn in #97
ESM2 Golden Value Testing by @farhadrgh in #85
Add pretraining documentation by @sichu2023 in #283
Wandb integration by @olachinkei in #205
Fix address in docs by @farhadrgh in #297
update branch name bionemo2 by @dorotat-nv in #160
Updated README docum...

Contributors

jstjohn, malcolmgreaves, and 20 other contributors

Assets 2

23 Oct 21:54

tshimko-nv

v2.0

291d0ac

NVIDIA BioNeMo Framework 2.0

New Features:

ESM2 implementation
- State of the art training performance and equivalent accuracy to the reference implementation
- 650M, and 3B scale checkpoints available which mirror the reference model
- Flexible fine-tuning examples that can be copied and modified to accomplish a wide variety of downstream tasks
First version of our NeMo v2 based reference implementation which re-imagines bionemo as a repository of megatron models, dataloaders, and training recipes which make use of NeMo v2 for training loops.
- Modular design and permissible Apache 2 OSS licenses enables the import and use of our framework in proprietary applications.
- NeMo2 training abstractions allows the user to focus on the model implementation while the training strategy handles distribution and model parallelism.
Documentation and documentation build system for BioNeMo 2.

Known Issues:

PEFT support is not yet fully functional.
Partial implementation of Geneformer is present, use at your own risk. It will be optimized and officially released in the future.
Command line interface is currently based on one-off training recipes and scripts. We are working on a configuration based approach that will be released in the future.
Fine-tuning workflow is implemented for BERT based architectures and could be adapted for others, but it requires you to inherit from the biobert base model config. You can follow similar patterns in the short term to load weights from an old checkpoint partially into a new model, however in the future we will have a more direct API which is easier to follow.
Slow memory leak occurs during ESM-2 pretraining, which can cause OOM during long pretraining runs. Training with a
microbatch size of 48 on 40 A100s raised an out-of-memory error after 5,800 training steps.
- Possible workarounds include calling gc.collect(); torch.cuda.empty_cache() at every ~1,000 steps, which appears
  to reclaim the consumed memory; or training with a lower microbatch size and re-starting training from a saved
  checkpoint periodically.

External Partner Contributions

We would like to thank the following organizations for their insightful discussions guiding the development of the BioNeMo Framework and their valuable contributions to the codebase. We are grateful for your collaboration.

Changes

Add GitHub workflow by @ohadmo in #9
Move v2 commits over. by @jstjohn in #8
Jstjohn/fix geneformer multinode by @jstjohn in #17
places the ptl artifacts ignore lines to the root directory only. by @skothenhill-nv in #21
ESM2 implementation by @farhadrgh in #28
Update dependency tags to match PR #36, and try to fix test failure by @jstjohn in #39
Test checkpoint IO loss is close to expected. by @jstjohn in #37
Change to gelu default from relu which is what we actually used before by @jstjohn in #20
Make artifact downloads more robust by @pstjohn in #41
Add devcontainer config for bionemo2 by @pstjohn in #5
Add license check to pre-commit hook by @ohadmo in #22
Use github runners to run pre-commit hooks by @pstjohn in #42
Add back the removed bionemo-core sub-package by @malcolmgreaves in #25
trivial commit to bionemo2 by @broland-hat in #19
Add mamba as a dependency in the dockerfile by @pstjohn in #44
Add future TE support and mixed precision support to biobert test by @jstjohn in #43
Add trufflehog as a github action check by @pstjohn in #45
Adds CONTRIBUTING, CODE-REVIEW guides and pull request template by @malcolmgreaves in #10
Use precision lowest value instead of -torch.inf by @farhadrgh in #35
Add NeMo and Megatron-LM as git submodules by @pstjohn in #52
Add a CLI option to restore training from a nemo1 checkpoint by @jstjohn in #54
Add some additional ruff checks, ignoring existing violations by @pstjohn in #56
Reorganize bionemo-contrib into namespace packages by @malcolmgreaves in #51
Update devcontainer for new package structure by @pstjohn in #62
Tell pytest to ignore 3rdparty/{NeMo,MegatronLM} by @malcolmgreaves in #61
Clean up src vs test mirroring rule violations. by @jstjohn in #66
fixing devcontainer target by @pstjohn in #64
adding merge_group to existing actions by @pstjohn in #71
Split out the lightning example tutorial by @jstjohn in #67
Reconfigure the pre-commit workflow by @pstjohn in #63
convert root_directory to a field with default_factory by @pstjohn in #58
Checkpointing example with Geneformer by @skothenhill-nv in #24
Updates to devcontainer by @skothenhill-nv in #77
Adding license, and contributing guidelines from #72 and #65 by @jstjohn in #74
adding some additional docstrings by @pstjohn in #81
Pin ptl to <2.4.0 to fix nemo bug by @pstjohn in #86
Add documentation build system for BioNeMo v2 by @pstjohn in #40
Add BERT-style masking function by @pstjohn in #55
fix post-create command by @pstjohn in #88
Pbinder/move scdl by @polinabinder1 in #76
Add ESM2 Dataset and Datamodule by @pstjohn in #78
Upgrade nemo and megatron, and fix configs to reflect the change by @jstjohn in #92
Bump 3rdparty/Megatron-LM from 104d864 to cf0f9b2 by @dependabot in #96
ESM2 Golden Value Testing by @farhadrgh in #85
fixing version issue by @polinabinder1 in #90
adding github action for docs deployment by @pstjohn in #98
Jared/v2 main/nvidia styles by @jwilber in #101
Handle special tokens in the bert masking function by @pstjohn in #99
add search highlight + code copy capabilities by @jwilber in #102
add internal link for devcontainer cache by @pstjohn in #105
Fix Geneformer huggingface links by @ohadmo in #106
Fixing secuirty scan vulnerabilities by @ohadmo in #104
add jupyter notebook support in documentation by @pstjohn in #109
Adding Dataloading Test cases and documentation by @polinabinder1 in #107
Bump 3rdparty/NeMo from e6c0e72 to ff7c614 by @dependabot in #103
Pbinder/readme modify by @polinabinder1 in #115
Promote nltk version to address GHSA-cgvx-9447 by @ohadmo in #114
moving test data around by @polinabinder1 in #118
Bump 3rdparty/Megatron-LM from cf0f9b2 to ef85bc9 by @dependabot in #124
Establish CODEOWNERS for bionemo2 by @malcolmgreaves in #121
chown /usr/local's dist-packages to allow editing them in the devcontainer by @pstjohn in #111
Stop and Go harness and tests for geneformer and GPT. by @skothenhill-nv in #116
Bump NeMo/Mcore by @skothenhill-nv in #127
Complete ESM2 pretraining by @sichu2023 in #112
LightningDataModule for webdataset by @DejunL in https://github.com/NVIDIA/bionemo-framework/pull...

Contributors

kkersten, jstjohn, and 22 other contributors

Assets 2

23 Oct 21:54

tshimko-nv

v1.10

9ba9b2c

NVIDIA BioNeMo Framework 1.10

Changes

Migrated development from NVIDIA internal to GitHub
License changed from NVIDIA proprietary to Apache 2.0
1.10 release is functionally equivalent to 1.9 release, previous Release Notes can be found in the documentation directory of the GitHub repository

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Features

Updates & Improvements

Known Issues

Changes

New Contributors

Contributors

New Features:

Changes

Contributors

New Features:

Known Issues:

External Partner Contributions

Changes

Contributors

Changes

Releases: NVIDIA/bionemo-framework

NVIDIA BioNeMo Framework v2.2

New Features

Updates & Improvements

Known Issues

Changes

New Contributors

Contributors

NVIDIA BioNeMo Framework 2.1

New Features:

Changes

Contributors

NVIDIA BioNeMo Framework 2.0

New Features:

Known Issues:

External Partner Contributions

Changes

Contributors

NVIDIA BioNeMo Framework 1.10

Changes