You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# What does this PR do?
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
## Who can review?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
Copy file name to clipboardExpand all lines: docs/source/basic_tutorials/launcher.md
+121-1Lines changed: 121 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,34 +8,52 @@ Text Generation Launcher
8
8
Usage: text-generation-launcher [OPTIONS]
9
9
10
10
Options:
11
+
```
12
+
## MODEL_ID
13
+
```shell
11
14
--model-id <MODEL_ID>
12
15
The name of the model to load. Can be a MODEL_ID as listed on <https://hf.co/models> like `gpt2` or `OpenAssistant/oasst-sft-1-pythia-12b`. Or it can be a local directory containing the necessary files as saved by `save_pretrained(...)` methods of transformers
13
16
14
17
[env: MODEL_ID=]
15
18
[default: bigscience/bloom-560m]
16
19
20
+
```
21
+
## REVISION
22
+
```shell
17
23
--revision <REVISION>
18
24
The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id or a branch like `refs/pr/2`
19
25
20
26
[env: REVISION=]
21
27
28
+
```
29
+
## VALIDATION_WORKERS
30
+
```shell
22
31
--validation-workers <VALIDATION_WORKERS>
23
32
The number of tokenizer workers used for payload validation and truncation inside the router
24
33
25
34
[env: VALIDATION_WORKERS=]
26
35
[default: 2]
27
36
37
+
```
38
+
## SHARDED
39
+
```shell
28
40
--sharded <SHARDED>
29
41
Whether to shard the model across multiple GPUs By default text-generation-inference will use all available GPUs to run the model. Setting it to `false` deactivates `num_shard`
30
42
31
43
[env: SHARDED=]
32
44
[possible values: true, false]
33
45
46
+
```
47
+
## NUM_SHARD
48
+
```shell
34
49
--num-shard <NUM_SHARD>
35
50
The number of shards to use if you don't want to use all GPUs on a given machine. You can use `CUDA_VISIBLE_DEVICES=0,1 text-generation-launcher... --num_shard 2` and `CUDA_VISIBLE_DEVICES=2,3 text-generation-launcher... --num_shard 2` to launch 2 copies with 2 shard each on a given machine with 4 GPUs for instance
36
51
37
52
[env: NUM_SHARD=]
38
53
54
+
```
55
+
## QUANTIZE
56
+
```shell
39
57
--quantize <QUANTIZE>
40
58
Whether you want the model to be quantized
41
59
@@ -49,53 +67,80 @@ Options:
49
67
- bitsandbytes-nf4: Bitsandbytes 4bit. Can be applied on any model, will cut the memory requirement by 4x, but it is known that the model will be much slower to run than the native f16
50
68
- bitsandbytes-fp4: Bitsandbytes 4bit. nf4 should be preferred in most cases but maybe this one has better perplexity performance for you model
51
69
70
+
```
71
+
## DTYPE
72
+
```shell
52
73
--dtype <DTYPE>
53
74
The dtype to be forced upon the model. This option cannot be used with `--quantize`
54
75
55
76
[env: DTYPE=]
56
77
[possible values: float16, bfloat16]
57
78
79
+
```
80
+
## TRUST_REMOTE_CODE
81
+
```shell
58
82
--trust-remote-code
59
83
Whether you want to execute hub modelling code. Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision
The maximum amount of concurrent requests for this particular deployment. Having a low limit will refuse clients requests instead of having them waitfor too long and is usually good to handle backpressure correctly
65
92
66
93
[env: MAX_CONCURRENT_REQUESTS=]
67
94
[default: 128]
68
95
96
+
```
97
+
## MAX_BEST_OF
98
+
```shell
69
99
--max-best-of <MAX_BEST_OF>
70
100
This is the maximum allowed value forclients to set `best_of`. Best of makes `n` generations at the same time, and return the bestin terms of overall log probability over the entire generated sequence
71
101
72
102
[env: MAX_BEST_OF=]
73
103
[default: 2]
74
104
105
+
```
106
+
## MAX_STOP_SEQUENCES
107
+
```shell
75
108
--max-stop-sequences <MAX_STOP_SEQUENCES>
76
109
This is the maximum allowed value forclients to set `stop_sequences`. Stop sequences are used to allow the model to stop on more than just the EOS token, and enable more complex "prompting" where users can preprompt the modelin a specific way and define their "own" stop token aligned with their prompt
77
110
78
111
[env: MAX_STOP_SEQUENCES=]
79
112
[default: 4]
80
113
114
+
```
115
+
## MAX_TOP_N_TOKENS
116
+
```shell
81
117
--max-top-n-tokens <MAX_TOP_N_TOKENS>
82
118
This is the maximum allowed value for clients to set`top_n_tokens`. `top_n_tokens is used to return information about the the `n` most likely tokens at each generation step, instead of just the sampled token. This information can be used for downstream tasks like for classification or ranking
83
119
84
120
[env: MAX_TOP_N_TOKENS=]
85
121
[default: 5]
86
122
123
+
```
124
+
## MAX_INPUT_LENGTH
125
+
```shell
87
126
--max-input-length <MAX_INPUT_LENGTH>
88
127
This is the maximum allowed input length (expressed in number of tokens) for users. The larger this value, the longer prompt users can send which can impact the overall memory required to handle the load. Please note that some models have a finite range of sequence they can handle
89
128
90
129
[env: MAX_INPUT_LENGTH=]
91
130
[default: 1024]
92
131
132
+
```
133
+
## MAX_TOTAL_TOKENS
134
+
```shell
93
135
--max-total-tokens <MAX_TOTAL_TOKENS>
94
136
This is the most important value to set as it defines the "memory budget" of running clients requests. Clients will send input sequences and ask to generate `max_new_tokens` on top. with a value of `1512` users can send either a prompt of `1000` and ask for`512` new tokens, or send a prompt of `1` and ask for `1511` max_new_tokens. The larger this value, the larger amount each request will bein your RAM and the less effective batching can be
95
137
96
138
[env: MAX_TOTAL_TOKENS=]
97
139
[default: 2048]
98
140
141
+
```
142
+
## WAITING_SERVED_RATIO
143
+
```shell
99
144
--waiting-served-ratio <WAITING_SERVED_RATIO>
100
145
This represents the ratio of waiting queries vs running queries where you want to start considering pausing the running queries to include the waiting ones into the same batch. `waiting_served_ratio=1.2` Means when 12 queries are waiting and there's only 10 queries left in the current batch we check if we can fit those 12 waiting queries into the batching strategy, and if yes, then batching happens delaying the 10 running queries by a `prefill` run.
Limits the number of tokens for the prefill operation. Since this operation take the most memory and is compute bound, it is interesting to limit the number of requests that can be sent
109
157
110
158
[env: MAX_BATCH_PREFILL_TOKENS=]
111
159
[default: 4096]
112
160
161
+
```
162
+
## MAX_BATCH_TOTAL_TOKENS
163
+
```shell
113
164
--max-batch-total-tokens <MAX_BATCH_TOTAL_TOKENS>
114
165
**IMPORTANT** This is one critical control to allow maximum usage of the available hardware.
115
166
@@ -123,6 +174,9 @@ Options:
123
174
124
175
[env: MAX_BATCH_TOTAL_TOKENS=]
125
176
177
+
```
178
+
## MAX_WAITING_TOKENS
179
+
```shell
126
180
--max-waiting-tokens <MAX_WAITING_TOKENS>
127
181
This setting defines how many tokens can be passed before forcing the waiting queries to be put on the batch (if the size of the batch allows forit). New queries require 1 `prefill` forward, which is different from `decode` and therefore you need to pause the running batchin order to run `prefill` to create the correct values for the waiting queries to be able to join the batch.
128
182
@@ -135,57 +189,87 @@ Options:
135
189
[env: MAX_WAITING_TOKENS=]
136
190
[default: 20]
137
191
192
+
```
193
+
## HOSTNAME
194
+
```shell
138
195
--hostname <HOSTNAME>
139
196
The IP address to listen on
140
197
141
198
[env: HOSTNAME=]
142
199
[default: 0.0.0.0]
143
200
201
+
```
202
+
## PORT
203
+
```shell
144
204
-p, --port <PORT>
145
205
The port to listen on
146
206
147
207
[env: PORT=]
148
208
[default: 3000]
149
209
210
+
```
211
+
## SHARD_UDS_PATH
212
+
```shell
150
213
--shard-uds-path <SHARD_UDS_PATH>
151
214
The name of the socket for gRPC communication between the webserver and the shards
152
215
153
216
[env: SHARD_UDS_PATH=]
154
217
[default: /tmp/text-generation-server]
155
218
219
+
```
220
+
## MASTER_ADDR
221
+
```shell
156
222
--master-addr <MASTER_ADDR>
157
223
The address the master shard will listen on. (setting used by torch distributed)
158
224
159
225
[env: MASTER_ADDR=]
160
226
[default: localhost]
161
227
228
+
```
229
+
## MASTER_PORT
230
+
```shell
162
231
--master-port <MASTER_PORT>
163
232
The address the master port will listen on. (setting used by torch distributed)
164
233
165
234
[env: MASTER_PORT=]
166
235
[default: 29500]
167
236
237
+
```
238
+
## HUGGINGFACE_HUB_CACHE
239
+
```shell
168
240
--huggingface-hub-cache <HUGGINGFACE_HUB_CACHE>
169
241
The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance
170
242
171
243
[env: HUGGINGFACE_HUB_CACHE=]
172
244
245
+
```
246
+
## WEIGHTS_CACHE_OVERRIDE
247
+
```shell
173
248
--weights-cache-override <WEIGHTS_CACHE_OVERRIDE>
174
249
The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance
175
250
176
251
[env: WEIGHTS_CACHE_OVERRIDE=]
177
252
253
+
```
254
+
## DISABLE_CUSTOM_KERNELS
255
+
```shell
178
256
--disable-custom-kernels
179
257
For some models (like bloom), text-generation-inference implemented custom cuda kernels to speed up inference. Those kernels were only tested on A100. Use this flag to disable them if you're running on different hardware and encounter issues
180
258
181
259
[env: DISABLE_CUSTOM_KERNELS=]
182
260
261
+
```
262
+
## CUDA_MEMORY_FRACTION
263
+
```shell
183
264
--cuda-memory-fraction <CUDA_MEMORY_FRACTION>
184
265
Limit the CUDA available memory. The allowed value equals the total visible memory multiplied by cuda-memory-fraction
185
266
186
267
[env: CUDA_MEMORY_FRACTION=]
187
268
[default: 1.0]
188
269
270
+
```
271
+
## ROPE_SCALING
272
+
```shell
189
273
--rope-scaling <ROPE_SCALING>
190
274
Rope scaling will only be used for RoPE models and allow rescaling the position rotary to accomodate for larger prompts.
191
275
@@ -198,50 +282,86 @@ Options:
198
282
[env: ROPE_SCALING=]
199
283
[possible values: linear, dynamic]
200
284
285
+
```
286
+
## ROPE_FACTOR
287
+
```shell
201
288
--rope-factor <ROPE_FACTOR>
202
289
Rope scaling will only be used for RoPE models See `rope_scaling`
203
290
204
291
[env: ROPE_FACTOR=]
205
292
293
+
```
294
+
## JSON_OUTPUT
295
+
```shell
206
296
--json-output
207
297
Outputs the logs in JSON format (useful for telemetry)
208
298
209
299
[env: JSON_OUTPUT=]
210
300
301
+
```
302
+
## OTLP_ENDPOINT
303
+
```shell
211
304
--otlp-endpoint <OTLP_ENDPOINT>
212
305
[env: OTLP_ENDPOINT=]
213
306
307
+
```
308
+
## CORS_ALLOW_ORIGIN
309
+
```shell
214
310
--cors-allow-origin <CORS_ALLOW_ORIGIN>
215
311
[env: CORS_ALLOW_ORIGIN=]
216
312
313
+
```
314
+
## WATERMARK_GAMMA
315
+
```shell
217
316
--watermark-gamma <WATERMARK_GAMMA>
218
317
[env: WATERMARK_GAMMA=]
219
318
319
+
```
320
+
## WATERMARK_DELTA
321
+
```shell
220
322
--watermark-delta <WATERMARK_DELTA>
221
323
[env: WATERMARK_DELTA=]
222
324
325
+
```
326
+
## NGROK
327
+
```shell
223
328
--ngrok
224
329
Enable ngrok tunneling
225
330
226
331
[env: NGROK=]
227
332
333
+
```
334
+
## NGROK_AUTHTOKEN
335
+
```shell
228
336
--ngrok-authtoken <NGROK_AUTHTOKEN>
229
337
ngrok authentication token
230
338
231
339
[env: NGROK_AUTHTOKEN=]
232
340
341
+
```
342
+
## NGROK_EDGE
343
+
```shell
233
344
--ngrok-edge <NGROK_EDGE>
234
345
ngrok edge
235
346
236
347
[env: NGROK_EDGE=]
237
348
349
+
```
350
+
## ENV
351
+
```shell
238
352
-e, --env
239
353
Display a lot of information about your runtime environment
0 commit comments