Skip to content

NVIDIA-NeMo/Megatron-Bridge

Repository files navigation

Megatron Bridge

CICD NeMo Python 3.10+ GitHub Stars

Recipes | Examples | Contributing

Overview

Megatron Bridge is a PyTorch native library under NeMo Framework that leverages megatron-core to provide state-of-the-art training throughput for top models. It enables researchers and community developers to do both pre and post training using a performant and scalable training loop, with features like model parallelisms and mixed precisions (FP8, BF16, FP4 etc.). Megatron Bridge users can either leverage existing πŸ€—HuggingFace models or define their custom PyTorch model definitions for end-to-end workflows with flexibility.

πŸ”§ Installation

🐳 NeMo-FW container

Best experience, highest performance and full feature support is guaranteed by the NeMo Framework container. Please fetch the most recent $TAG and run the following command to start a container:

docker run --rm -it -w /workdir -v $(pwd):/workdir \
  --entrypoint bash \
  --gpus all \
  nvcr.io/nvidia/nemo:${TAG}

πŸ“¦ Bare metal install with TransformerEngine

TransformerEngine is a required dependency for Megatron Bridge. To install on bare metal (without any container), the following system requirements need to be fulfilled:

  • PyTorch >= 2.7
  • CUDA >= 12.8
  • cuDNN >= 9.3

We recommend installing the same versions that are present in the latest NGC PyTorch containers. The versions of these components for each container release can be found in the PyTorch and CUDA container release notes.

Please see these instructions for installing cuDNN for your target platform. You can check if CUDA toolkit and cuDNN are installed with:

dpkg -l | grep 'cuda-toolkit'
dpkg -l | grep 'cudnn.*cuda'

You can then run the following to install Megatron Bridge:

pip install torch setuptools pybind11 wheel_stub  # Required for TE
pip install --no-build-isolation megatron-bridge

uv

For installing Megatron Bridge with uv, please refer to our Contribution guide

⚑ Quickstart

To get started, first install Megatron Bridge or download a NeMo Framework container as described above.

Log in to HuggingFace Hub:

huggingface-cli login --token <your token>

You can then run the following to import a model from HuggingFace and start training with mock data:

from megatron.bridge import AutoBridge

import megatron.bridge.recipes.llama.llama32_1b as llama32_1b
from megatron.bridge.training.gpt_step import forward_step
from megatron.bridge.training.pretrain import pretrain

if __name__ == "__main__":
    # Load Llama from HuggingFace Hub and convert to Megatron
    bridge = AutoBridge.from_hf_pretrained("meta-llama/Llama-3.2-1B")
    model_provider = bridge.to_megatron_provider()

    # Get defaults for other configuration from an existing Llama 3.2 recipe
    cfg = llama32_1b.pretrain_config()
    cfg.model = model_provider
    cfg.train.train_iters = 10

    cfg.dataset.sequence_length = cfg.model.seq_length
    cfg.tokenizer.vocab_size = cfg.model.vocab_size

    pretrain(cfg, forward_step)

You can launch the above script with:

torchrun --nproc-per-node=<num devices> /path/to/script.py

πŸš€ Key Features

  • Bridge with πŸ€—Hugging Face: Seamless bidirectional conversion between πŸ€—Hugging Face and Megatron formats for interoperability (model bridges, auto bridge, conversion examples)
  • Flexible to Customize: Lightweight custom training loop making it easy to configure custom logic in data loading, distributed training, checkpointing, evaluation and logging (training framework, training utilities)
  • Supervised & Parameter-Efficient Finetuning: SFT & PEFT implementation tailored for Megatron-based models that supports LoRA, DoRA, and user-defined PEFT methods (PEFT implementations, finetune module, SFT dataset)
  • SoTA Training Recipes: Pre-configured production-ready training recipes for popular models like Llama 3, with optimized hyperparameters and distributed training configuration (Llama recipes, recipe examples)
  • Performance Optimization: Built-in support for FP8 training, model parallelisms, and memory-efficient techniques to offer high utilization and near linear scalability to thousands of nodes. (mixed precision, communication overlap, optimizer utilities)

Supported Models

Megatron Bridge provides out-of-the-box recipes for a wide range of models, built on top of base model architectures from megatron-core:

Large Language Models

Model Style Sizes Pretrain SFT & LoRA
Llama 3 GPT 8b, 70b βœ… APIs available, recipes upcoming
Llama 3.1 GPT 8b, 70b, 405b βœ… APIs available, recipes upcoming
Llama 3.2 GPT 1b, 3b βœ… APIs available, recipes upcoming

Launching Recipes

All recipes are ready to train out of the box, using mock data by default. For an example of how to override the default configuration through YAML or Hydra-style CLI overrides, please have a look at this script. The script can then be launched with torchrun. For example, with the aforementioned script:

torchrun --nproc-per-node=2 pretrain_llama3_8b.py model.tensor_model_parallel_size=1 <additional overrides ...>

Optionally, Megatron Bridge also supports launching with NeMo-Run. See the following examples for reference on launching with NeMo-Run:

These examples can also be run as is with the Llama 3 8b recipe (with NeMo-Run installed).

Launch Llama 3 8b Pretraining with NeMo-Run's run.Script:

uv run python pretrain_llama3_8b_nemo_run_script.py \
    --nproc-per-node=2 \
    model.pipeline_model_parallel_size=1 \
    train.train_iters=10 # this script passes Hydra-style overrides to the target script

Launch Llama 3 8b Pretraining with NeMo-Run's run.Partial

uv run python pretrain_llama3_8b_nemo_run_partial.py \
    --nproc-per-node=2

Performance Benchmarks

Coming soon ...

Project Structure

Megatron-Bridge/
β”œβ”€β”€ examples/
β”‚   β”œβ”€β”€ models/                  # Bridge usage examples
β”‚   └── recipes/                 # Training examples
β”œβ”€β”€ src/megatron/bridge/
β”‚   β”œβ”€β”€ data/                    # Dataloaders and iterators
β”‚   β”œβ”€β”€ models/                  # HuggingFace bridge infrastructure and model-specific implementations
β”‚   β”‚   β”œβ”€β”€ llama/               # Llama model providers
β”‚   β”‚   └── .../                 # Other models (gpt, t5, etc.)
β”‚   β”œβ”€β”€ peft/                    # PEFT transformations and wrappers
β”‚   β”œβ”€β”€ recipes/                 # Complete training recipes
β”‚   β”œβ”€β”€ training/                # Training loop components
β”‚   β”‚   β”œβ”€β”€ tokenizers/          # Tokenizer library
β”‚   β”‚   └── utils/               # Training-specific utilities
β”‚   └── utils/                   # Generic utilities for repo-wide usage
└── tests/                       # Comprehensive test suite

Contributing

We welcome community contributions! Please see our Contributor Guidelines for more information on how to get involved.

About

Training library for Megatron-based models

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 13