Skip to content

Commit 8ec1b87

Browse files
authored
Adding titles to CLI doc. (huggingface#1094)
# What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->
1 parent b4f68c3 commit 8ec1b87

File tree

2 files changed

+148
-2
lines changed

2 files changed

+148
-2
lines changed

docs/source/basic_tutorials/launcher.md

Lines changed: 121 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,34 +8,52 @@ Text Generation Launcher
88
Usage: text-generation-launcher [OPTIONS]
99

1010
Options:
11+
```
12+
## MODEL_ID
13+
```shell
1114
--model-id <MODEL_ID>
1215
The name of the model to load. Can be a MODEL_ID as listed on <https://hf.co/models> like `gpt2` or `OpenAssistant/oasst-sft-1-pythia-12b`. Or it can be a local directory containing the necessary files as saved by `save_pretrained(...)` methods of transformers
1316

1417
[env: MODEL_ID=]
1518
[default: bigscience/bloom-560m]
1619

20+
```
21+
## REVISION
22+
```shell
1723
--revision <REVISION>
1824
The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id or a branch like `refs/pr/2`
1925
2026
[env: REVISION=]
2127
28+
```
29+
## VALIDATION_WORKERS
30+
```shell
2231
--validation-workers <VALIDATION_WORKERS>
2332
The number of tokenizer workers used for payload validation and truncation inside the router
2433
2534
[env: VALIDATION_WORKERS=]
2635
[default: 2]
2736
37+
```
38+
## SHARDED
39+
```shell
2840
--sharded <SHARDED>
2941
Whether to shard the model across multiple GPUs By default text-generation-inference will use all available GPUs to run the model. Setting it to `false` deactivates `num_shard`
3042
3143
[env: SHARDED=]
3244
[possible values: true, false]
3345
46+
```
47+
## NUM_SHARD
48+
```shell
3449
--num-shard <NUM_SHARD>
3550
The number of shards to use if you don't want to use all GPUs on a given machine. You can use `CUDA_VISIBLE_DEVICES=0,1 text-generation-launcher... --num_shard 2` and `CUDA_VISIBLE_DEVICES=2,3 text-generation-launcher... --num_shard 2` to launch 2 copies with 2 shard each on a given machine with 4 GPUs for instance
3651

3752
[env: NUM_SHARD=]
3853

54+
```
55+
## QUANTIZE
56+
```shell
3957
--quantize <QUANTIZE>
4058
Whether you want the model to be quantized
4159

@@ -49,53 +67,80 @@ Options:
4967
- bitsandbytes-nf4: Bitsandbytes 4bit. Can be applied on any model, will cut the memory requirement by 4x, but it is known that the model will be much slower to run than the native f16
5068
- bitsandbytes-fp4: Bitsandbytes 4bit. nf4 should be preferred in most cases but maybe this one has better perplexity performance for you model
5169

70+
```
71+
## DTYPE
72+
```shell
5273
--dtype <DTYPE>
5374
The dtype to be forced upon the model. This option cannot be used with `--quantize`
5475

5576
[env: DTYPE=]
5677
[possible values: float16, bfloat16]
5778

79+
```
80+
## TRUST_REMOTE_CODE
81+
```shell
5882
--trust-remote-code
5983
Whether you want to execute hub modelling code. Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision
6084

6185
[env: TRUST_REMOTE_CODE=]
6286

87+
```
88+
## MAX_CONCURRENT_REQUESTS
89+
```shell
6390
--max-concurrent-requests <MAX_CONCURRENT_REQUESTS>
6491
The maximum amount of concurrent requests for this particular deployment. Having a low limit will refuse clients requests instead of having them wait for too long and is usually good to handle backpressure correctly
6592

6693
[env: MAX_CONCURRENT_REQUESTS=]
6794
[default: 128]
6895

96+
```
97+
## MAX_BEST_OF
98+
```shell
6999
--max-best-of <MAX_BEST_OF>
70100
This is the maximum allowed value for clients to set `best_of`. Best of makes `n` generations at the same time, and return the best in terms of overall log probability over the entire generated sequence
71101

72102
[env: MAX_BEST_OF=]
73103
[default: 2]
74104

105+
```
106+
## MAX_STOP_SEQUENCES
107+
```shell
75108
--max-stop-sequences <MAX_STOP_SEQUENCES>
76109
This is the maximum allowed value for clients to set `stop_sequences`. Stop sequences are used to allow the model to stop on more than just the EOS token, and enable more complex "prompting" where users can preprompt the model in a specific way and define their "own" stop token aligned with their prompt
77110

78111
[env: MAX_STOP_SEQUENCES=]
79112
[default: 4]
80113

114+
```
115+
## MAX_TOP_N_TOKENS
116+
```shell
81117
--max-top-n-tokens <MAX_TOP_N_TOKENS>
82118
This is the maximum allowed value for clients to set `top_n_tokens`. `top_n_tokens is used to return information about the the `n` most likely tokens at each generation step, instead of just the sampled token. This information can be used for downstream tasks like for classification or ranking
83119
84120
[env: MAX_TOP_N_TOKENS=]
85121
[default: 5]
86122
123+
```
124+
## MAX_INPUT_LENGTH
125+
```shell
87126
--max-input-length <MAX_INPUT_LENGTH>
88127
This is the maximum allowed input length (expressed in number of tokens) for users. The larger this value, the longer prompt users can send which can impact the overall memory required to handle the load. Please note that some models have a finite range of sequence they can handle
89128
90129
[env: MAX_INPUT_LENGTH=]
91130
[default: 1024]
92131
132+
```
133+
## MAX_TOTAL_TOKENS
134+
```shell
93135
--max-total-tokens <MAX_TOTAL_TOKENS>
94136
This is the most important value to set as it defines the "memory budget" of running clients requests. Clients will send input sequences and ask to generate `max_new_tokens` on top. with a value of `1512` users can send either a prompt of `1000` and ask for `512` new tokens, or send a prompt of `1` and ask for `1511` max_new_tokens. The larger this value, the larger amount each request will be in your RAM and the less effective batching can be
95137
96138
[env: MAX_TOTAL_TOKENS=]
97139
[default: 2048]
98140
141+
```
142+
## WAITING_SERVED_RATIO
143+
```shell
99144
--waiting-served-ratio <WAITING_SERVED_RATIO>
100145
This represents the ratio of waiting queries vs running queries where you want to start considering pausing the running queries to include the waiting ones into the same batch. `waiting_served_ratio=1.2` Means when 12 queries are waiting and there's only 10 queries left in the current batch we check if we can fit those 12 waiting queries into the batching strategy, and if yes, then batching happens delaying the 10 running queries by a `prefill` run.
101146
@@ -104,12 +149,18 @@ Options:
104149
[env: WAITING_SERVED_RATIO=]
105150
[default: 1.2]
106151
152+
```
153+
## MAX_BATCH_PREFILL_TOKENS
154+
```shell
107155
--max-batch-prefill-tokens <MAX_BATCH_PREFILL_TOKENS>
108156
Limits the number of tokens for the prefill operation. Since this operation take the most memory and is compute bound, it is interesting to limit the number of requests that can be sent
109157
110158
[env: MAX_BATCH_PREFILL_TOKENS=]
111159
[default: 4096]
112160
161+
```
162+
## MAX_BATCH_TOTAL_TOKENS
163+
```shell
113164
--max-batch-total-tokens <MAX_BATCH_TOTAL_TOKENS>
114165
**IMPORTANT** This is one critical control to allow maximum usage of the available hardware.
115166
@@ -123,6 +174,9 @@ Options:
123174
124175
[env: MAX_BATCH_TOTAL_TOKENS=]
125176
177+
```
178+
## MAX_WAITING_TOKENS
179+
```shell
126180
--max-waiting-tokens <MAX_WAITING_TOKENS>
127181
This setting defines how many tokens can be passed before forcing the waiting queries to be put on the batch (if the size of the batch allows for it). New queries require 1 `prefill` forward, which is different from `decode` and therefore you need to pause the running batch in order to run `prefill` to create the correct values for the waiting queries to be able to join the batch.
128182
@@ -135,57 +189,87 @@ Options:
135189
[env: MAX_WAITING_TOKENS=]
136190
[default: 20]
137191
192+
```
193+
## HOSTNAME
194+
```shell
138195
--hostname <HOSTNAME>
139196
The IP address to listen on
140197
141198
[env: HOSTNAME=]
142199
[default: 0.0.0.0]
143200
201+
```
202+
## PORT
203+
```shell
144204
-p, --port <PORT>
145205
The port to listen on
146206
147207
[env: PORT=]
148208
[default: 3000]
149209
210+
```
211+
## SHARD_UDS_PATH
212+
```shell
150213
--shard-uds-path <SHARD_UDS_PATH>
151214
The name of the socket for gRPC communication between the webserver and the shards
152215
153216
[env: SHARD_UDS_PATH=]
154217
[default: /tmp/text-generation-server]
155218
219+
```
220+
## MASTER_ADDR
221+
```shell
156222
--master-addr <MASTER_ADDR>
157223
The address the master shard will listen on. (setting used by torch distributed)
158224
159225
[env: MASTER_ADDR=]
160226
[default: localhost]
161227
228+
```
229+
## MASTER_PORT
230+
```shell
162231
--master-port <MASTER_PORT>
163232
The address the master port will listen on. (setting used by torch distributed)
164233
165234
[env: MASTER_PORT=]
166235
[default: 29500]
167236
237+
```
238+
## HUGGINGFACE_HUB_CACHE
239+
```shell
168240
--huggingface-hub-cache <HUGGINGFACE_HUB_CACHE>
169241
The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance
170242
171243
[env: HUGGINGFACE_HUB_CACHE=]
172244
245+
```
246+
## WEIGHTS_CACHE_OVERRIDE
247+
```shell
173248
--weights-cache-override <WEIGHTS_CACHE_OVERRIDE>
174249
The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance
175250
176251
[env: WEIGHTS_CACHE_OVERRIDE=]
177252
253+
```
254+
## DISABLE_CUSTOM_KERNELS
255+
```shell
178256
--disable-custom-kernels
179257
For some models (like bloom), text-generation-inference implemented custom cuda kernels to speed up inference. Those kernels were only tested on A100. Use this flag to disable them if you're running on different hardware and encounter issues
180258
181259
[env: DISABLE_CUSTOM_KERNELS=]
182260
261+
```
262+
## CUDA_MEMORY_FRACTION
263+
```shell
183264
--cuda-memory-fraction <CUDA_MEMORY_FRACTION>
184265
Limit the CUDA available memory. The allowed value equals the total visible memory multiplied by cuda-memory-fraction
185266
186267
[env: CUDA_MEMORY_FRACTION=]
187268
[default: 1.0]
188269
270+
```
271+
## ROPE_SCALING
272+
```shell
189273
--rope-scaling <ROPE_SCALING>
190274
Rope scaling will only be used for RoPE models and allow rescaling the position rotary to accomodate for larger prompts.
191275
@@ -198,50 +282,86 @@ Options:
198282
[env: ROPE_SCALING=]
199283
[possible values: linear, dynamic]
200284
285+
```
286+
## ROPE_FACTOR
287+
```shell
201288
--rope-factor <ROPE_FACTOR>
202289
Rope scaling will only be used for RoPE models See `rope_scaling`
203290
204291
[env: ROPE_FACTOR=]
205292
293+
```
294+
## JSON_OUTPUT
295+
```shell
206296
--json-output
207297
Outputs the logs in JSON format (useful for telemetry)
208298
209299
[env: JSON_OUTPUT=]
210300
301+
```
302+
## OTLP_ENDPOINT
303+
```shell
211304
--otlp-endpoint <OTLP_ENDPOINT>
212305
[env: OTLP_ENDPOINT=]
213306
307+
```
308+
## CORS_ALLOW_ORIGIN
309+
```shell
214310
--cors-allow-origin <CORS_ALLOW_ORIGIN>
215311
[env: CORS_ALLOW_ORIGIN=]
216312
313+
```
314+
## WATERMARK_GAMMA
315+
```shell
217316
--watermark-gamma <WATERMARK_GAMMA>
218317
[env: WATERMARK_GAMMA=]
219318
319+
```
320+
## WATERMARK_DELTA
321+
```shell
220322
--watermark-delta <WATERMARK_DELTA>
221323
[env: WATERMARK_DELTA=]
222324
325+
```
326+
## NGROK
327+
```shell
223328
--ngrok
224329
Enable ngrok tunneling
225330
226331
[env: NGROK=]
227332
333+
```
334+
## NGROK_AUTHTOKEN
335+
```shell
228336
--ngrok-authtoken <NGROK_AUTHTOKEN>
229337
ngrok authentication token
230338
231339
[env: NGROK_AUTHTOKEN=]
232340
341+
```
342+
## NGROK_EDGE
343+
```shell
233344
--ngrok-edge <NGROK_EDGE>
234345
ngrok edge
235346
236347
[env: NGROK_EDGE=]
237348
349+
```
350+
## ENV
351+
```shell
238352
-e, --env
239353
Display a lot of information about your runtime environment
240354
355+
```
356+
## HELP
357+
```shell
241358
-h, --help
242359
Print help (see a summary with '-h')
243360
361+
```
362+
## VERSION
363+
```shell
244364
-V, --version
245365
Print version
246366
247-
```
367+
```

update_doc.py

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,34 @@ def main():
1111
output = subprocess.check_output(["text-generation-launcher", "--help"]).decode(
1212
"utf-8"
1313
)
14+
1415
wrap_code_blocks_flag = "<!-- WRAP CODE BLOCKS -->"
15-
final_doc = f"# Text-generation-launcher arguments\n\n{wrap_code_blocks_flag}\n\n```shell\n{output}\n```"
16+
final_doc = f"# Text-generation-launcher arguments\n\n{wrap_code_blocks_flag}\n\n"
17+
18+
lines = output.split("\n")
19+
20+
header = ""
21+
block = []
22+
for line in lines:
23+
if line.startswith(" -") or line.startswith(" -"):
24+
rendered_block = '\n'.join(block)
25+
if header:
26+
final_doc += f"## {header}\n```shell\n{rendered_block}\n```\n"
27+
else:
28+
final_doc += f"```shell\n{rendered_block}\n```\n"
29+
block = []
30+
tokens = line.split("<")
31+
if len(tokens)>1:
32+
header = tokens[-1][:-1]
33+
else:
34+
header = line.split("--")[-1]
35+
header = header.upper().replace("-", "_")
36+
37+
block.append(line)
38+
39+
rendered_block = '\n'.join(block)
40+
final_doc += f"## {header}\n```shell\n{rendered_block}\n```\n"
41+
block = []
1642

1743
filename = "docs/source/basic_tutorials/launcher.md"
1844
if args.check:

0 commit comments

Comments
 (0)