bigcode-project · jlamypoirier · Jan 15, 2024 · Jan 12, 2024 · Jan 15, 2024 · Jan 15, 2024
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1 @@
+include megatron/core/requirements.txt
diff --git a/README.md b/README.md
@@ -241,7 +241,7 @@ With full global batch size of 1536 on 1024 A100 GPUs, each iteration takes arou
 
 
 Retro [(Borgeaud et al., 2022)](https://arxiv.org/abs/2112.04426) is an autoregressive decoder-only language model (LM) pretrained with retrieval-augmentation. 
-Retro features practical scalibility to support large-scale pretraining from scratch by retrieving from trillions of token.
+Retro features practical scalability to support large-scale pretraining from scratch by retrieving from trillions of tokens.
 Pretraining with retrieval provides a more efficient storage mechanism of factual knowledge, when compared to storing factual knowledge implicitly within the network's parameters, thus largely reducing model parameters while achieving lower perplexity than standard GPT. 
 Retro also provides the flexibility to update the
 knowledge stored in LMs [(Wang et al., 2023a)](https://arxiv.org/abs/2304.06762)
@@ -519,6 +519,12 @@ The Llama-2 [family of models](https://ai.meta.com/llama/) are an open-source se
 
 The Llama-2 checkpoints can be loaded into Megatron for inference and finetuning. See documentation [here](docs/llama2.md).
 
+# Model Optimization and Deployment
+Megatron-Core (MCore) `GPTModel` family supports advanced quantization algorithms and high-performance deployment through TensorRT-LLM.
+
+## Quantization and TensorRT-LLM Deployment
+See [Megatron Model Optimization and Deployment](examples/modelopt/README.md) for `llama2` and `nemotron3` examples.
+
 # Datasets
 We do not host any datasets for GPT or BERT training, however, we detail their collection so that our results may be reproduced.
 

diff --git a/docs/source/api-guide/dist_checkpointing.rst b/docs/source/api-guide/dist_checkpointing.rst
@@ -1,6 +1,15 @@
 dist\_checkpointing package
 ===========================
 
+A library for saving and loading the distributed checkpoints.
+A "distributed checkpoint" can have various underlying formats (current default format is based on Zarr)
+but has a distinctive property - the checkpoint saved in one parallel configuration (tensor/pipeline/data parallelism)
+can be loaded in a different parallel configuration.
+
+Using the library requires defining sharded state_dict dictionaries with functions from  *mapping* and *optimizer* modules.
+Those state dicts can be saved or loaded with a *serialization* module using strategies from *strategies* module.
+
+
 Subpackages
 -----------
 
@@ -12,18 +21,10 @@ Subpackages
 Submodules
 ----------
 
-dist\_checkpointing.core module
--------------------------------
-
-.. automodule:: core.dist_checkpointing.core
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-dist\_checkpointing.dict\_utils module
---------------------------------------
+dist\_checkpointing.serialization module
+----------------------------------------
 
-.. automodule:: core.dist_checkpointing.dict_utils
+.. automodule:: core.dist_checkpointing.serialization
    :members:
    :undoc-members:
    :show-inheritance:
@@ -44,14 +45,23 @@ dist\_checkpointing.optimizer module
    :undoc-members:
    :show-inheritance:
 
-dist\_checkpointing.serialization module
-----------------------------------------
+dist\_checkpointing.core module
+-------------------------------
 
-.. automodule:: core.dist_checkpointing.serialization
+.. automodule:: core.dist_checkpointing.core
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+dist\_checkpointing.dict\_utils module
+--------------------------------------
+
+.. automodule:: core.dist_checkpointing.dict_utils
    :members:
    :undoc-members:
    :show-inheritance:
 
+
 dist\_checkpointing.utils module
 --------------------------------
 

diff --git a/docs/source/api-guide/dist_checkpointing.strategies.rst b/docs/source/api-guide/dist_checkpointing.strategies.rst
@@ -1,6 +1,11 @@
 dist\_checkpointing.strategies package
 ======================================
 
+Package defining different checkpoint formats (backends) and saving/loading algorithms (strategies).
+
+Strategies can be used for implementing new checkpoint formats or implementing new (more optimal for a given use case) ways of saving/loading of existing formats.
+Strategies are passed to `dist_checkpointing.load` and `dist_checkpointing.save` functions and control the actual saving/loading procedure.
+
 Submodules
 ----------
 

diff --git a/docs/source/api-guide/distributed.rst b/docs/source/api-guide/distributed.rst
@@ -1,6 +1,14 @@
 distributed package
 ===================
 
+This package contains various utilities to finalize model weight gradients
+on each rank before the optimizer step. This includes a distributed data
+parallelism wrapper to all-reduce or reduce-scatter the gradients across
+data-parallel replicas, and a `finalize\_model\_grads` method to
+synchronize gradients across different parallelism modes (e.g., 'tied'
+layers on different pipeline stages, or gradients for experts in a MoE on
+different ranks due to expert parallelism).
+
 Submodules
 ----------
 
@@ -21,10 +29,10 @@ reduce-scatter on each bucket asynchronously.
 distributed.finalize\_model\_grads
 ----------------------------------
 
-Finalize model grads for optimizer step across all used parallelism modes.
-Synchronizes the all-reduce / reduce-scatter of model grads across DP replicas,
-and all-reduces the layernorm grads for sequence parallelism, embedding grads
-across first and last pipeline stages (if not tied), and expert grads for expert
+Finalize model gradients for optimizer step across all used parallelism modes.
+Synchronizes the all-reduce / reduce-scatter of model gradients across DP replicas,
+all-reduces the layernorm gradients for sequence parallelism, embedding gradients
+across first and last pipeline stages (if not tied), and expert gradients for expert
 parallelism.
 
 .. automodule:: core.distributed.finalize_model_grads

diff --git a/docs/source/api-guide/pipeline_parallel.rst b/docs/source/api-guide/pipeline_parallel.rst
@@ -1,12 +1,22 @@
 pipeline\_parallel package
 ==========================
 
+This package contains implementations for two different pipeline parallelism
+schedules (one without interleaving and one with interleaving, see `Efficient
+Large-Scale Language Model Training on GPU Clusters Using Megatron-LM <https://arxiv.org/abs/2104.04473>`_
+for details), and a default no-pipelining schedule. It also contains methods
+for the point-to-point communication that is needed between pipeline stages.
+
 Submodules
 ----------
 
 pipeline\_parallel.p2p\_communication module
 --------------------------------------------
 
+Contains implementations for the various point-to-point communication needed
+(e.g., `recv_forward` and `recv_backward`) in the different pipeline parallelism
+schedules.
+
 .. automodule:: core.pipeline_parallel.p2p_communication
    :members:
    :undoc-members:
@@ -15,6 +25,14 @@ pipeline\_parallel.p2p\_communication module
 pipeline\_parallel.schedules module
 -----------------------------------
 
+Contains implementations for two pipeline parallelism schedules
+(`forward_backward_pipelining_with_interleaving`for pipeline parallelism with
+interleaving, `forward_backward_pipelining_without_interleaving` for pipeline
+parallelism without interleaving) and a default no-pipelining schedule
+(`forward_backward_no_pipelining`). `get_forward_backward_func` returns the right
+scheduling function to use based on the configuration being trained
+(e.g., if pipeline-parallel size is 1, use `forward_backward_no_pipelining`).
+
 .. automodule:: core.pipeline_parallel.schedules
    :members:
    :undoc-members:

diff --git a/docs/source/api-guide/tensor_parallel.rst b/docs/source/api-guide/tensor_parallel.rst
@@ -1,6 +1,12 @@
 tensor\_parallel package
 ========================
 
+This package contains an implementation for tensor parallelism in transformer
+models (see `Megatron-LM: Training Multi-Billion Parameter Language Models
+Using Model Parallelism <https://arxiv.org/abs/1909.08053>`_ and `Reducing
+Activation Recomputation in Large Transformer Models <https://arxiv.org/abs/2205.05198>`_
+for details).
+
 Submodules
 ----------
 

diff --git a/examples/bert/train_bert_340m_distributed.sh b/examples/bert/train_bert_340m_distributed.sh
@@ -12,9 +12,9 @@ NUM_NODES=1
 NODE_RANK=0
 WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))
 
-CHECKPOINT_PATH=$0 #<Specify path>
-TENSORBOARD_LOGS_PATH=$1 #<Specify path>
-VOCAB_FILE=$2 #<Specify path to file>/bert-vocab.json
+CHECKPOINT_PATH=$1 #<Specify path>
+TENSORBOARD_LOGS_PATH=$2 #<Specify path>
+VOCAB_FILE=$3 #<Specify path to file>/bert-vocab.json
 DATA_PATH=$4 #<Specify path and file prefix>_text_document
 
 DISTRIBUTED_ARGS=(

diff --git a/examples/deploy/README.md b/examples/deploy/README.md
@@ -0,0 +1,132 @@
+# Megatron Model Optimization and Deployment
+
+## Installation
+We recommend that users follow TensorRT-LLM's official installation guide to build it from source
+and proceed with a containerized environment (`docker.io/tensorrt_llm/release:latest`):
+
+```
+git clone https://github.com/NVIDIA/TensorRT-LLM.git
+cd TensorRT-LLM
+git checkout v0.7.1
+make -C docker release_build
+```
+
+> **TROUBLE SHOOTING:** rather than copying each folder separately in `docker/Dockerfile.multi`,
+> you may need to copy the entire dir as `COPY ./ /src/tensorrt_llm` since a `git submodule` is
+> called later which requires `.git` to continue.
+
+Once the container is built, install `nvidia-ammo` and additional dependencies for sharded checkpoint support:
+```
+pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
+pip install zarr tensorstore==0.1.45
+```
+TensorRT-LLM quantization functionalities are currently packaged in `nvidia-ammo`.
+You can find more documentation about `nvidia-ammo` in [TensorRT-LLM's quantization
+examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization).
+
+## Support Matrix
+
+The following matrix shows the current support for the PTQ + TensorRT-LLM export flow. 
+
+| model                       | fp16 | int8_sq | fp8 | int4_awq |
+|-----------------------------|------|---------| ----| -------- |
+| nextllm-2b                  | x    | x       |   x |          |
+| nemotron3-8b                | x    |         |   x |          |
+| nemotron3-15b               | x    |         |   x |          |
+| llama2-text-7b              | x    | x       |   x |      TP2 |
+| llama2-chat-70b             | x    | x       |   x |      TP4 |
+
+Our PTQ + TensorRT-LLM flow has native support on MCore `GPTModel` with a mixed layer spec (native ParallelLinear
+and Transformer-Engine Norm (`TENorm`). Note that this is not the default mcore gpt spec. You can still load the
+following checkpoint formats with some remedy:
+
+| GPTModel                          | sharded |                        remedy arguments |
+|-----------------------------------|---------|-----------------------------------------|
+| megatron.model                    |         | `--ammo-load-classic-megatron-to-mcore` |
+| TE-Fused (default mcore gpt spec) |         | `--ammo-convert-te-to-local-spec`       |
+| TE-Fused (default mcore gpt spec) |       x |                                         |
+
+> **TROUBLE SHOOTING:** If you are trying to load an unpacked `.nemo` sharded checkpoint, then typically you will
+> need to adding `additional_sharded_prefix="model."` to `ammo_load_checkpoint()` since NeMo has an additional
+> `model.` wrapper on top of the `GPTModel`.
+
+> **NOTE:** flag `--ammo-load-classic-megatron-to-mcore` may not work on all legacy checkpoint versions.
+
+## Examples
+
+> **NOTE:** we only provide a simple text generation script to test the generated TensorRT-LLM engines. For
+> a production-level API server or enterprise support, see [NeMo](https://github.com/NVIDIA/NeMo) and TensorRT-LLM's
+> backend for [NVIDIA Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server).
+
+### nemotron3-8B FP8 Quantization and TensorRT-LLM Deployment
+First download the nemotron checkpoint from https://huggingface.co/nvidia/nemotron-3-8b-base-4k, extract the
+sharded checkpoint from the `.nemo` tarbal and fix the tokenizer file name.
+
+> **NOTE:** The following cloning method uses `ssh`, and assume you have registered the `ssh-key` in Hugging Face.
+> If you are want to clone with `https`, then `git clone https://huggingface.co/nvidia/nemotron-3-8b-base-4k` with an access token.
+
+```sh
+git lfs install
+git clone [email protected]:nvidia/nemotron-3-8b-base-4k
+cd nemotron-3-8b-base-4k
+tar -xvf Nemotron-3-8B-Base-4k.nemo
+mv 586f3f51a9cf43bc9369bd53fa08868c_a934dc7c3e1e46a6838bb63379916563_3feba89c944047c19d5a1d0c07a85c32_mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model
+cd ..
+```
+
+Now launch the PTQ + TensorRT-LLM export script,
+```
+bash examples/deploy/ptq_trtllm_nemotron3_8b ./nemotron-3-8b-base-4k None
+```
+By default, `cnn_dailymail` is used for calibration. The `GPTModel` will have quantizers for simulating the
+quantization effect. The checkpoint will be saved optionally (with quantizers as additional states) and can
+be restored for further evaluation. TensorRT-LLM engine is exported to `/tmo/ammo` by default.
+
+The script expects `${CHECKPOINT_DIR}` (`./nemotron-3-8b-base-4k`) to have the following structure:
+```
+├── model_weights
+│   ├── common.pt
+│   ...
+│
+├── model_config.yaml
+├── mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model
+```
+
+> **NOTE:** The script is using `TP=8`. Change `$TP` in the script if your checkpoint has a different tensor
+> model parallelism.
+
+> **KNOWN ISSUES:** The `mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model` in the checkpoint is for
+> Megatron-LM's `GPTSentencePiece` tokenizer.
+> For TensorRT-LLM, we are trying to load this tokenizer as a Hugging Face `T5Tokenizer` by changing
+> some special tokens, `encode`, and `batch_decode`. As a result, the tokenizer behavior in TensorRT-LLM engine may
+> not match exactly.
+
+> **TROUBLE SHOOTING:** If you are loading `.nemo` sharded checkpoint here, call 
+> `ammo_load_checkpoint(..., additional_sharded_prefix="model.")` with additional sharded prefix in
+> `text_generation_ptq.py` to align the sharded keys.
+
+### llama2-text-7b INT8 SmoothQuant and TensorRT-LLM Deployment
+> **NOTE:** Due to the LICENSE issue, we do not provide a MCore checkpoint to download. Users can follow
+> the instruction in `docs/llama2.md` to convert the checkpoint to megatron classic `GPTModel` format and
+> use `--ammo-load-classic-megatron-to-mcore` flag which will remap the checkpoint to the MCore `GPTModel` spec
+> that we support.
+
+```sh
+bash examples/deploy/ptq_trtllm_llama_7b.sh ${CHECKPOINT_DIR}
+```
+
+The script expect `${CHECKPOINT_DIR}` to have the following structure:
+```
+├── hf
+│   ├── tokenizer.config
+│   ├── tokenizer.model
+│   ...
+│
+├── iter_0000001
+│   ├── mp_rank_00
+│   ...
+│
+├── latest_checkpointed_iteration.txt
+```
+In short, other than the converted llama megatron checkpoint, also put the Hugging Face checkpoint inside as
+the source of the tokenizer.