@@ -74,7 +74,7 @@ cd ../..
74
74
75
75
Now launch the PTQ + TensorRT-LLM export script,
76
76
``` sh
77
- bash examples/inference/quantization /ptq_trtllm_minitron_8b ./Minitron-8B-Base None
77
+ bash examples/export/ptq_and_trtllm_export /ptq_trtllm_minitron_8b ./Minitron-8B-Base None
78
78
```
79
79
By default, ` cnn_dailymail ` is used for calibration. The ` GPTModel ` will have quantizers for simulating the
80
80
quantization effect. The checkpoint will be saved optionally (with quantizers as additional states) and can
@@ -104,12 +104,12 @@ export trtllm_options=" \
104
104
--checkpoint_dir /tmp/trtllm_ckpt \
105
105
--output_dir /tmp/trtllm_engine \
106
106
--max_input_len 2048 \
107
- --max_output_len 512 \
107
+ --max_seq_len 512 \
108
108
--max_batch_size 8 "
109
109
110
110
trtllm-build ${trtllm_options}
111
111
112
- python examples/inference/quantization /trtllm_text_generation.py --tokenizer nvidia/Minitron-8B-Base
112
+ python examples/export/ptq_and_trtllm_export /trtllm_text_generation.py --tokenizer nvidia/Minitron-8B-Base
113
113
```
114
114
115
115
### mistral-12B FP8 Quantization and TensorRT-LLM Deployment
@@ -139,7 +139,7 @@ huggingface-cli login
139
139
Now launch the PTQ + TensorRT-LLM checkpoint export script,
140
140
141
141
``` sh
142
- bash examples/inference/quantization /ptq_trtllm_mistral_12b.sh ./Mistral-NeMo-12B-Base None
142
+ bash examples/export/ptq_and_trtllm_export /ptq_trtllm_mistral_12b.sh ./Mistral-NeMo-12B-Base None
143
143
```
144
144
145
145
Then build TensorRT engine and run text generation example using the newly built TensorRT engine
@@ -149,12 +149,12 @@ export trtllm_options=" \
149
149
--checkpoint_dir /tmp/trtllm_ckpt \
150
150
--output_dir /tmp/trtllm_engine \
151
151
--max_input_len 2048 \
152
- --max_output_len 512 \
152
+ --max_seq_len 512 \
153
153
--max_batch_size 8 "
154
154
155
155
trtllm-build ${trtllm_options}
156
156
157
- python examples/inference/quantization /trtllm_text_generation.py --tokenizer mistralai/Mistral-Nemo-Base-2407
157
+ python examples/export/ptq_and_trtllm_export /trtllm_text_generation.py --tokenizer mistralai/Mistral-Nemo-Base-2407
158
158
```
159
159
160
160
@@ -165,7 +165,7 @@ python examples/inference/quantization/trtllm_text_generation.py --tokenizer mis
165
165
> that we support.
166
166
167
167
``` sh
168
- bash examples/inference/quantization /ptq_trtllm_llama_7b.sh ${CHECKPOINT_DIR}
168
+ bash examples/export/ptq_and_trtllm_export /ptq_trtllm_llama_7b.sh ${CHECKPOINT_DIR}
169
169
```
170
170
171
171
The script expect ` ${CHECKPOINT_DIR} ` to have the following structure:
@@ -184,8 +184,23 @@ The script expect `${CHECKPOINT_DIR}` to have the following structure:
184
184
In short, other than the converted llama megatron checkpoint, also put the Hugging Face checkpoint inside as
185
185
the source of the tokenizer.
186
186
187
+ Then build TensorRT engine and run text generation example using the newly built TensorRT engine
188
+
189
+ ``` sh
190
+ export trtllm_options=" \
191
+ --checkpoint_dir /tmp/trtllm_ckpt \
192
+ --output_dir /tmp/trtllm_engine \
193
+ --max_input_len 2048 \
194
+ --max_seq_len 512 \
195
+ --max_batch_size 8 "
196
+
197
+ trtllm-build ${trtllm_options}
198
+
199
+ python examples/export/ptq_and_trtllm_export/trtllm_text_generation.py --tokenizer meta-llama/Llama-2-7b
200
+ ```
201
+
187
202
### llama3-8b / llama3.1-8b INT8 SmoothQuant and TensorRT-LLM Deployment
188
- > ** NOTE:** For llama3.1, the missing rope_scaling parameter will be fixed in modelopt-0.17 and trtllm-0.12 .
203
+ > ** NOTE:** For llama3.1, the missing rope_scaling parameter will be fixed in modelopt-0.19 and trtllm-0.13 .
189
204
190
205
> ** NOTE:** There are two ways to acquire the checkpoint. Users can follow
191
206
> the instruction in ` docs/llama2.md ` to convert the checkpoint to megatron legacy ` GPTModel ` format and
@@ -199,16 +214,23 @@ If users choose to download the model from NGC, first extract the sharded checkp
199
214
tar -xvf 8b_pre_trained_bf16.nemo
200
215
```
201
216
217
+ > ** NOTE:** You need a token generated from huggingface.co/settings/tokens and access to meta-llama/Llama-3.1-8B or meta-llama/Llama-3-8B on huggingface
218
+
219
+ ``` sh
220
+ pip install -U " huggingface_hub[cli]"
221
+ huggingface-cli login
222
+ ```
223
+
202
224
Now launch the PTQ + TensorRT-LLM checkpoint export script for llama-3,
203
225
204
226
``` sh
205
- bash examples/inference/quantization /ptq_trtllm_llama3_8b.sh ./llama-3-8b-nemo_v1.0 None
227
+ bash examples/export/ptq_and_trtllm_export /ptq_trtllm_llama3_8b.sh ./llama-3-8b-nemo_v1.0 None
206
228
```
207
229
208
230
or llama-3.1
209
231
210
232
``` sh
211
- bash examples/inference/quantization /ptq_trtllm_llama3_1_8b.sh ./llama-3_1-8b-nemo_v1.0 None
233
+ bash examples/export/ptq_and_trtllm_export /ptq_trtllm_llama3_1_8b.sh ./llama-3_1-8b-nemo_v1.0 None
212
234
```
213
235
214
236
Then build TensorRT engine and run text generation example using the newly built TensorRT engine
@@ -218,14 +240,14 @@ export trtllm_options=" \
218
240
--checkpoint_dir /tmp/trtllm_ckpt \
219
241
--output_dir /tmp/trtllm_engine \
220
242
--max_input_len 2048 \
221
- --max_output_len 512 \
243
+ --max_seq_len 512 \
222
244
--max_batch_size 8 "
223
245
224
246
trtllm-build ${trtllm_options}
225
247
226
- python examples/inference/quantization /trtllm_text_generation.py --tokenizer meta-llama/Meta-Llama-3-8B
248
+ python examples/export/ptq_and_trtllm_export /trtllm_text_generation.py --tokenizer meta-llama/Meta-Llama-3-8B
227
249
# For llama-3
228
250
229
- python examples/inference/quantization /trtllm_text_generation.py --tokenizer meta-llama/Meta-Llama-3.1-8B
251
+ python examples/export/ptq_and_trtllm_export /trtllm_text_generation.py --tokenizer meta-llama/Meta-Llama-3.1-8B
230
252
# For llama-3.1
231
253
```
0 commit comments