Skip to content

Commit 843a22e

Browse files
yueshen2016ericharper
authored andcommitted
ADLR/megatron-lm!2180 - rotary_scaling fix for llama3.1 and 3.2
1 parent 065260b commit 843a22e

File tree

9 files changed

+50
-32
lines changed

9 files changed

+50
-32
lines changed

examples/export/ptq_and_trtllm_export/README.md

Lines changed: 35 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ cd ../..
7474

7575
Now launch the PTQ + TensorRT-LLM export script,
7676
```sh
77-
bash examples/inference/quantization/ptq_trtllm_minitron_8b ./Minitron-8B-Base None
77+
bash examples/export/ptq_and_trtllm_export/ptq_trtllm_minitron_8b ./Minitron-8B-Base None
7878
```
7979
By default, `cnn_dailymail` is used for calibration. The `GPTModel` will have quantizers for simulating the
8080
quantization effect. The checkpoint will be saved optionally (with quantizers as additional states) and can
@@ -104,12 +104,12 @@ export trtllm_options=" \
104104
--checkpoint_dir /tmp/trtllm_ckpt \
105105
--output_dir /tmp/trtllm_engine \
106106
--max_input_len 2048 \
107-
--max_output_len 512 \
107+
--max_seq_len 512 \
108108
--max_batch_size 8 "
109109

110110
trtllm-build ${trtllm_options}
111111

112-
python examples/inference/quantization/trtllm_text_generation.py --tokenizer nvidia/Minitron-8B-Base
112+
python examples/export/ptq_and_trtllm_export/trtllm_text_generation.py --tokenizer nvidia/Minitron-8B-Base
113113
```
114114

115115
### mistral-12B FP8 Quantization and TensorRT-LLM Deployment
@@ -139,7 +139,7 @@ huggingface-cli login
139139
Now launch the PTQ + TensorRT-LLM checkpoint export script,
140140

141141
```sh
142-
bash examples/inference/quantization/ptq_trtllm_mistral_12b.sh ./Mistral-NeMo-12B-Base None
142+
bash examples/export/ptq_and_trtllm_export/ptq_trtllm_mistral_12b.sh ./Mistral-NeMo-12B-Base None
143143
```
144144

145145
Then build TensorRT engine and run text generation example using the newly built TensorRT engine
@@ -149,12 +149,12 @@ export trtllm_options=" \
149149
--checkpoint_dir /tmp/trtllm_ckpt \
150150
--output_dir /tmp/trtllm_engine \
151151
--max_input_len 2048 \
152-
--max_output_len 512 \
152+
--max_seq_len 512 \
153153
--max_batch_size 8 "
154154

155155
trtllm-build ${trtllm_options}
156156

157-
python examples/inference/quantization/trtllm_text_generation.py --tokenizer mistralai/Mistral-Nemo-Base-2407
157+
python examples/export/ptq_and_trtllm_export/trtllm_text_generation.py --tokenizer mistralai/Mistral-Nemo-Base-2407
158158
```
159159

160160

@@ -165,7 +165,7 @@ python examples/inference/quantization/trtllm_text_generation.py --tokenizer mis
165165
> that we support.
166166
167167
```sh
168-
bash examples/inference/quantization/ptq_trtllm_llama_7b.sh ${CHECKPOINT_DIR}
168+
bash examples/export/ptq_and_trtllm_export/ptq_trtllm_llama_7b.sh ${CHECKPOINT_DIR}
169169
```
170170

171171
The script expect `${CHECKPOINT_DIR}` to have the following structure:
@@ -184,8 +184,23 @@ The script expect `${CHECKPOINT_DIR}` to have the following structure:
184184
In short, other than the converted llama megatron checkpoint, also put the Hugging Face checkpoint inside as
185185
the source of the tokenizer.
186186

187+
Then build TensorRT engine and run text generation example using the newly built TensorRT engine
188+
189+
```sh
190+
export trtllm_options=" \
191+
--checkpoint_dir /tmp/trtllm_ckpt \
192+
--output_dir /tmp/trtllm_engine \
193+
--max_input_len 2048 \
194+
--max_seq_len 512 \
195+
--max_batch_size 8 "
196+
197+
trtllm-build ${trtllm_options}
198+
199+
python examples/export/ptq_and_trtllm_export/trtllm_text_generation.py --tokenizer meta-llama/Llama-2-7b
200+
```
201+
187202
### llama3-8b / llama3.1-8b INT8 SmoothQuant and TensorRT-LLM Deployment
188-
> **NOTE:** For llama3.1, the missing rope_scaling parameter will be fixed in modelopt-0.17 and trtllm-0.12.
203+
> **NOTE:** For llama3.1, the missing rope_scaling parameter will be fixed in modelopt-0.19 and trtllm-0.13.
189204
190205
> **NOTE:** There are two ways to acquire the checkpoint. Users can follow
191206
> the instruction in `docs/llama2.md` to convert the checkpoint to megatron legacy `GPTModel` format and
@@ -199,16 +214,23 @@ If users choose to download the model from NGC, first extract the sharded checkp
199214
tar -xvf 8b_pre_trained_bf16.nemo
200215
```
201216

217+
> **NOTE:** You need a token generated from huggingface.co/settings/tokens and access to meta-llama/Llama-3.1-8B or meta-llama/Llama-3-8B on huggingface
218+
219+
```sh
220+
pip install -U "huggingface_hub[cli]"
221+
huggingface-cli login
222+
```
223+
202224
Now launch the PTQ + TensorRT-LLM checkpoint export script for llama-3,
203225

204226
```sh
205-
bash examples/inference/quantization/ptq_trtllm_llama3_8b.sh ./llama-3-8b-nemo_v1.0 None
227+
bash examples/export/ptq_and_trtllm_export/ptq_trtllm_llama3_8b.sh ./llama-3-8b-nemo_v1.0 None
206228
```
207229

208230
or llama-3.1
209231

210232
```sh
211-
bash examples/inference/quantization/ptq_trtllm_llama3_1_8b.sh ./llama-3_1-8b-nemo_v1.0 None
233+
bash examples/export/ptq_and_trtllm_export/ptq_trtllm_llama3_1_8b.sh ./llama-3_1-8b-nemo_v1.0 None
212234
```
213235

214236
Then build TensorRT engine and run text generation example using the newly built TensorRT engine
@@ -218,14 +240,14 @@ export trtllm_options=" \
218240
--checkpoint_dir /tmp/trtllm_ckpt \
219241
--output_dir /tmp/trtllm_engine \
220242
--max_input_len 2048 \
221-
--max_output_len 512 \
243+
--max_seq_len 512 \
222244
--max_batch_size 8 "
223245

224246
trtllm-build ${trtllm_options}
225247

226-
python examples/inference/quantization/trtllm_text_generation.py --tokenizer meta-llama/Meta-Llama-3-8B
248+
python examples/export/ptq_and_trtllm_export/trtllm_text_generation.py --tokenizer meta-llama/Meta-Llama-3-8B
227249
# For llama-3
228250

229-
python examples/inference/quantization/trtllm_text_generation.py --tokenizer meta-llama/Meta-Llama-3.1-8B
251+
python examples/export/ptq_and_trtllm_export/trtllm_text_generation.py --tokenizer meta-llama/Meta-Llama-3.1-8B
230252
#For llama-3.1
231253
```

examples/export/ptq_and_trtllm_export/ptq_trtllm_llama2_7b.sh

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ options=" \
6666
--tokenizer-model ${TOKENIZER_MODEL} \
6767
--save-interval 1000000 \
6868
--use-dist-ckpt \
69-
--load ${CHECKPOINT_LOAD_DIR}
69+
--load ${CHECKPOINT_LOAD_DIR} \
7070
--fp16"
7171

7272
# Precompile CUDA extentions
@@ -76,7 +76,5 @@ python -c "import modelopt.torch.quantization.extensions as ext; print(ext.cuda_
7676
launch_config="--nproc_per_node=${TP}"
7777

7878
# Launch multi-process with torchrun
79-
torchrun ${launch_config} examples/inference/quantization/text_generation_ptq.py ${options} ${additional_options}
79+
torchrun ${launch_config} examples/export/ptq_and_trtllm_export/text_generation_ptq.py ${options} ${additional_options}
8080

81-
# This script is using mpi4py which will fork multiple processes.
82-
python examples/inference/quantization/trtllm_text_generation.py ${trtllm_options}

examples/export/ptq_and_trtllm_export/ptq_trtllm_llama3_1_8b.sh

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -63,9 +63,10 @@ options=" \
6363
--tokenizer-type HuggingFaceTokenizer \
6464
--tokenizer-model meta-llama/Meta-Llama-3.1-8B \
6565
--save-interval 1000000 \
66+
--use-rope-scaling \
6667
--use-dist-ckpt \
67-
--load ${CHECKPOINT_LOAD_DIR}
68-
--rotary-base 500000
68+
--load ${CHECKPOINT_LOAD_DIR} \
69+
--rotary-base 500000 \
6970
--fp16"
7071

7172
# Precompile CUDA extentions
@@ -75,4 +76,4 @@ python -c "import modelopt.torch.quantization.extensions as ext; print(ext.cuda_
7576
launch_config="--nproc_per_node=${TP}"
7677

7778
# Launch multi-process with torchrun
78-
torchrun ${launch_config} examples/inference/quantization/text_generation_ptq.py ${options} ${additional_options}
79+
torchrun ${launch_config} examples/export/ptq_and_trtllm_export/text_generation_ptq.py ${options} ${additional_options}

examples/export/ptq_and_trtllm_export/ptq_trtllm_llama3_8b.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -64,8 +64,8 @@ options=" \
6464
--tokenizer-model meta-llama/Meta-Llama-3-8B \
6565
--save-interval 1000000 \
6666
--use-dist-ckpt \
67-
--load ${CHECKPOINT_LOAD_DIR}
68-
--rotary-base 500000
67+
--load ${CHECKPOINT_LOAD_DIR} \
68+
--rotary-base 500000 \
6969
--fp16"
7070

7171
# Precompile CUDA extentions
@@ -75,4 +75,4 @@ python -c "import modelopt.torch.quantization.extensions as ext; print(ext.cuda_
7575
launch_config="--nproc_per_node=${TP}"
7676

7777
# Launch multi-process with torchrun
78-
torchrun ${launch_config} examples/inference/quantization/text_generation_ptq.py ${options} ${additional_options}
78+
torchrun ${launch_config} examples/export/ptq_and_trtllm_export/text_generation_ptq.py ${options} ${additional_options}

examples/export/ptq_and_trtllm_export/ptq_trtllm_minitron_8b.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,4 +71,4 @@ python -c "import modelopt.torch.quantization.extensions as ext; print(ext.cuda_
7171
launch_config="--nproc_per_node=${TP}"
7272

7373
# Launch multi-process with torchrun
74-
torchrun ${launch_config} examples/inference/quantization/text_generation_ptq.py ${options} ${additional_options}
74+
torchrun ${launch_config} examples/export/ptq_and_trtllm_export/text_generation_ptq.py ${options} ${additional_options}

examples/export/ptq_and_trtllm_export/ptq_trtllm_mistral_12b.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,4 +72,4 @@ python -c "import modelopt.torch.quantization.extensions as ext; print(ext.cuda_
7272
launch_config="--nproc_per_node=${TP}"
7373

7474
# Launch multi-process with torchrun
75-
torchrun ${launch_config} examples/inference/quantization/text_generation_ptq.py ${options} ${additional_options}
75+
torchrun ${launch_config} examples/export/ptq_and_trtllm_export/text_generation_ptq.py ${options} ${additional_options}

examples/export/ptq_and_trtllm_export/text_generation_ptq.py

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,11 @@
66
import sys
77
from pathlib import Path
88

9-
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../../")))
9+
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../../../../")))
1010

1111
import modelopt.torch.quantization as mtq
1212
import torch
1313
from datasets import load_dataset
14-
from modelopt.torch.utils.distributed import set_data_parallel_group, set_tensor_parallel_group
1514
from tqdm import tqdm
1615

1716
# [ModelOpt]: changing the default model provider to the ModelOpt version
@@ -179,10 +178,6 @@ def hf_dataset_forword_loop_func(model):
179178
if args.calib_dataset is not None:
180179
ptq_forward_loop_func = hf_dataset_forword_loop_func
181180

182-
# Setting data parallel and tensor parallel group
183-
set_data_parallel_group(mpu.get_data_parallel_group())
184-
set_tensor_parallel_group(mpu.get_tensor_model_parallel_group())
185-
186181
if args.export_quant_cfg in QUANT_CFG_CHOICES:
187182
mtq_config = QUANT_CFG_CHOICES[args.export_quant_cfg]
188183
if "*output_layer*" not in mtq_config["quant_cfg"]:

megatron/core/models/gpt/gpt_model.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,10 +91,11 @@ def __init__(
9191
# TODO: remove this dependency ?
9292
self.model_type = ModelType.encoder_or_decoder
9393

94-
# These 2 attributes are needed for TensorRT-LLM export.
94+
# These 4 attributes are needed for TensorRT-LLM export.
9595
self.max_position_embeddings = max_sequence_length
9696
self.rotary_percent = rotary_percent
9797
self.rotary_base = rotary_base
98+
self.rotary_scaling = rope_scaling
9899

99100
if self.pre_process:
100101
self.embedding = LanguageModelEmbedding(

megatron/inference/gpt/model_provider.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@ def model_provider(pre_process=True, post_process=True, parallel_output=True) ->
6464
"position_embedding_type": args.position_embedding_type,
6565
"rotary_percent": args.rotary_percent,
6666
"rotary_base": args.rotary_base,
67+
"rope_scaling": args.use_rope_scaling,
6768
}
6869

6970
model = model_type(**model_kwargs)

0 commit comments

Comments
 (0)