Skip to content

Releases: NVIDIA/bionemo-framework

NVIDIA BioNeMo Framework v2.2

20 Dec 16:30
fd441fa
Compare
Choose a tag to compare

New Features

  • Small Molecule Featurization
    • Implemented elementary and advanced atom, bond, and full molecule featurizers.
  • GH200 Support for BioNeMo
    • Added a Dockerfile.arm that builds a BioNeMo container that runs on GH200 machines.
    • Publish a version of the BioNeMo container that supports multiple architectures to NGC.

Updates & Improvements

  • Single-Cell Dataloader (SCDL)
    • Changed metadata storage to parquet files, which creates a 30x speed up when iterating over a large dataset.
    • Added functionality to concatenate several anndata files without doubling disk memory usage.
  • ESM2
    • Added support for SIGTERM preemption checkpoint saving.
    • Moved ESM-2 and Geneformer training scripts to new executables, train_esm2 and train_geneformer, respectively.
    • Moved inference script to a new executable infer_esm2, and deprecated the inference example in the fine-tuning tutorial.
    • Added new Jupyter notebook tutorials for inference and zero-shot protein design. These notebooks can be deployed on the cloud resources as a brev.dev launchable.

Known Issues

  • Loading a checkpoint for Geneformer inference on H100 has a known regression in accuracy. Work is in progress to resolve by next release.

Changes

New Contributors

Full Changelog: v2.1...v2.2

NVIDIA BioNeMo Framework 2.1

21 Nov 00:33
cd4f48a
Compare
Choose a tag to compare

New Features:

  • ESM2 Implementation
    • Updated the ESM-2 Model Card with detailed performance benchmarks comparing BioNeMo2 training against vanilla pytorch.
    • Added ESM-2 inference endpoint for evaluating pre-trained models
  • Size-Aware Batching
    • Added SizeAwareBatchSampler, a pytorch data sampler that batches elements of varying sizes while ensuring that the total size of each batch does not exceed a specified maximum.
    • Added BucketBatchSampler, another pytorch data sampler that groups elements of varying sizes based on predefined bucket ranges, and create batches with elements from each bucket to ensure that each batch has elements with homogeneous sizes.
  • CLI Support
    • Added pydantic interface for pretraining jobs via parsing JSON configuration files that enables passing customized Model and DataModules classes.
    • Implemented pydantic configuration for Geneformer and ESM2 pretraining and finetuning.
    • Added 'recipes' for generating validated JSON files to be used with pydantic interface.
    • Added installable scripts for 2/3 respectively, bionemo-esm2-recipe, bionemo-esm2-train, bionemo-geneformer-recipe, bionemo-geneformer-train.
  • Geneformer support in BioNeMo2:
    • Tested pre-training scripts and fine-tuning example scripts that can be used as a starting point for users to create custom derivative models.
    • Geneformer 10M and 106M checkpoints ported from BioNeMo v1 into BioNeMo v2 available and included in documentation.
    • Added inference scripts
  • Documentation
    • Cell type classification example notebook which covers the process of converting anndata into our internal format, and running inference on that data with a geneformer checkpoint, as well as making use of the inference results.
    • Updated Getting Started guide, ESM-2 tutorials
    • Added Frequently Asked Questions (FAQ) page

Changes

Read more

NVIDIA BioNeMo Framework 2.0

23 Oct 21:54
291d0ac
Compare
Choose a tag to compare

New Features:

  • ESM2 implementation
    • State of the art training performance and equivalent accuracy to the reference implementation
    • 650M, and 3B scale checkpoints available which mirror the reference model
    • Flexible fine-tuning examples that can be copied and modified to accomplish a wide variety of downstream tasks
  • First version of our NeMo v2 based reference implementation which re-imagines bionemo as a repository of megatron models, dataloaders, and training recipes which make use of NeMo v2 for training loops.
    • Modular design and permissible Apache 2 OSS licenses enables the import and use of our framework in proprietary applications.
    • NeMo2 training abstractions allows the user to focus on the model implementation while the training strategy handles distribution and model parallelism.
  • Documentation and documentation build system for BioNeMo 2.

Known Issues:

  • PEFT support is not yet fully functional.
  • Partial implementation of Geneformer is present, use at your own risk. It will be optimized and officially released in the future.
  • Command line interface is currently based on one-off training recipes and scripts. We are working on a configuration based approach that will be released in the future.
  • Fine-tuning workflow is implemented for BERT based architectures and could be adapted for others, but it requires you to inherit from the biobert base model config. You can follow similar patterns in the short term to load weights from an old checkpoint partially into a new model, however in the future we will have a more direct API which is easier to follow.
  • Slow memory leak occurs during ESM-2 pretraining, which can cause OOM during long pretraining runs. Training with a
    microbatch size of 48 on 40 A100s raised an out-of-memory error after 5,800 training steps.
    • Possible workarounds include calling gc.collect(); torch.cuda.empty_cache() at every ~1,000 steps, which appears
      to reclaim the consumed memory; or training with a lower microbatch size and re-starting training from a saved
      checkpoint periodically.

External Partner Contributions

We would like to thank the following organizations for their insightful discussions guiding the development of the BioNeMo Framework and their valuable contributions to the codebase. We are grateful for your collaboration.

Changes

Read more

NVIDIA BioNeMo Framework 1.10

23 Oct 21:54
9ba9b2c
Compare
Choose a tag to compare

Changes

  • Migrated development from NVIDIA internal to GitHub
  • License changed from NVIDIA proprietary to Apache 2.0
  • 1.10 release is functionally equivalent to 1.9 release, previous Release Notes can be found in the documentation directory of the GitHub repository