You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have some issues with vllm + outlines. It seems the performance is WAAAY worse in the new outlines version. Could you help me? Using it with an H100. I observe that the GPU compute utilization is much lower.
Note that here I use llama 3.2 1B, but the difference increase with larger models: llama 3.3 70B in a private usecase I have has ~ 800 output tokens per second while the new outlines has ~ 70 of them.
New outlines e.g. vllm v0.6.5 (see the processed prompts speed) - outlines 0.1.8
WARNING 12-30 13:46:30 cuda.py:32] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
INFO 12-30 13:46:43 config.py:478] This model supports multiple tasks: {'score', 'reward', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
WARNING 12-30 13:46:43 arg_utils.py:1086] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 12-30 13:46:43 config.py:1364] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 12-30 13:46:43 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='/data/models/llama-v3.2-1b-it/', speculative_config=None, tokenizer='/data/models/llama-v3.2-1b-it/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/models/llama-v3.2-1b-it/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
INFO 12-30 13:46:45 selector.py:120] Using Flash Attention backend.
INFO 12-30 13:46:48 model_runner.py:1092] Starting to load model /data/models/llama-v3.2-1b-it/...
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.80it/s]
INFO 12-30 13:46:49 model_runner.py:1097] Loading model weights took 2.3185 GB
INFO 12-30 13:46:49 worker.py:241] Memory profiling takes 0.46 seconds
INFO 12-30 13:46:49 worker.py:241] the current vLLM instance can use total_gpu_memory (79.10GiB) x gpu_memory_utilization (0.90) = 71.19GiB
INFO 12-30 13:46:49 worker.py:241] model weights take 2.32GiB; non_torch_memory takes 0.17GiB; PyTorch activation peak memory takes 1.20GiB; the rest of the memory reserved for KV Cache is 67.50GiB.
INFO 12-30 13:46:50 gpu_executor.py:76] # GPU blocks: 138231, # CPU blocks: 8192
INFO 12-30 13:46:50 gpu_executor.py:80] Maximum concurrency for 131072 tokens per request: 16.87x
INFO 12-30 13:46:51 model_runner.py:1413] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-30 13:46:51 model_runner.py:1417] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 12-30 13:47:01 model_runner.py:1527] Graph capturing finished in 10 secs, took 0.21 GiB
INFO 12-30 13:47:01 llm_engine.py:446] init engine (profile, create kv cache, warmup model) took 12.39 seconds
Processed prompts: 100%|██████████| 100[/100](http://10.169.23.198:8081/100) [01:34<00:00, 1.06it[/s](http://10.169.23.198:8081/s), est. speed input: 81.68 toks[/s](http://10.169.23.198:8081/s), output: 104.59 toks[/s](http://10.169.23.198:8081/s)]
Old outlines vllm v0.6.1 - outlines 0.0.46
WARNING 12-30 13:48:29 arg_utils.py:930] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:29 config.py:1010] Chunked prefill is enabled with max_num_batched_tokens=512.
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:29 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/models/llama-v3.2-1b-it/', speculative_config=None, tokenizer='/data/models/llama-v3.2-1b-it/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/models/llama-v3.2-1b-it/, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:33 model_runner.py:1014] Starting to load model /data/models/llama-v3.2-1b-it/...
Loading safetensors checkpoint shards: 0% Completed | 0[/1](http://10.169.59.132:8081/1) [00:00<?, ?it[/s](http://10.169.59.132:8081/s)]
Loading safetensors checkpoint shards: 100% Completed | 1[/1](http://10.169.59.132:8081/1) [00:00<00:00, 1.98it[/s](http://10.169.59.132:8081/s)]
Loading safetensors checkpoint shards: 100% Completed | 1[/1](http://10.169.59.132:8081/1) [00:00<00:00, 1.98it[/s](http://10.169.59.132:8081/s)]
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:34 model_runner.py:1025] Loading model weights took 2.3185 GB
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:34 gpu_executor.py:122] # GPU blocks: 138166, # CPU blocks: 8192
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:36 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:36 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:46 model_runner.py:1456] Graph capturing finished in 10 secs.
Compiling FSM index for all state transitions: 0%| | 0[/47](http://10.169.59.132:8081/47) [00:00<?, ?it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions: 2%|▏ | 1[/47](http://10.169.59.132:8081/47) [00:00<00:12, 3.63it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions: 11%|█ | 5[/47](http://10.169.59.132:8081/47) [00:00<00:02, 15.28it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions: 19%|█▉ | 9[/47](http://10.169.59.132:8081/47) [00:00<00:01, 22.37it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions: 28%|██▊ | 13[/47](http://10.169.59.132:8081/47) [00:00<00:01, 20.41it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions: 36%|███▌ | 17[/47](http://10.169.59.132:8081/47) [00:00<00:01, 23.81it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions: 45%|████▍ | 21[/47](http://10.169.59.132:8081/47) [00:00<00:00, 26.30it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions: 53%|█████▎ | 25[/47](http://10.169.59.132:8081/47) [00:01<00:00, 28.93it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions: 62%|██████▏ | 29[/47](http://10.169.59.132:8081/47) [00:01<00:00, 31.01it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions: 70%|███████ | 33[/47](http://10.169.59.132:8081/47) [00:01<00:00, 32.52it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions: 79%|███████▊ | 37[/47](http://10.169.59.132:8081/47) [00:01<00:00, 32.99it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions: 87%|████████▋ | 41[/47](http://10.169.59.132:8081/47) [00:01<00:00, 26.43it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions: 96%|█████████▌| 45[/47](http://10.169.59.132:8081/47) [00:01<00:00, 28.13it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions: 100%|██████████| 47[/47](http://10.169.59.132:8081/47) [00:02<00:00, 21.38it[/s](http://10.169.59.132:8081/s)]
Processed prompts: 0%| | 0[/100](http://10.169.59.132:8081/100) [00:00<?, ?it[/s](http://10.169.59.132:8081/s), est. speed input: 0.00 toks[/s](http://10.169.59.132:8081/s), output: 0.00 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts: 1%| | 1[/100](http://10.169.59.132:8081/100) [00:00<00:34, 2.83it[/s](http://10.169.59.132:8081/s), est. speed input: 203.86 toks[/s](http://10.169.59.132:8081/s), output: 50.96 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts: 12%|█▏ | 12[/100](http://10.169.59.132:8081/100) [00:00<00:02, 32.77it[/s](http://10.169.59.132:8081/s), est. speed input: 1866.35 toks[/s](http://10.169.59.132:8081/s), output: 520.57 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts: 61%|██████ | 61[/100](http://10.169.59.132:8081/100) [00:00<00:00, 134.34it[/s](http://10.169.59.132:8081/s), est. speed input: 6475.71 toks[/s](http://10.169.59.132:8081/s), output: 2667.18 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts: 93%|█████████▎| 93[/100](http://10.169.59.132:8081/100) [00:00<00:00, 179.39it[/s](http://10.169.59.132:8081/s), est. speed input: 8432.94 toks[/s](http://10.169.59.132:8081/s), output: 4387.65 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts: 100%|██████████| 100[/100](http://10.169.59.132:8081/100) [00:01<00:00, 94.83it[/s](http://10.169.59.132:8081/s), est. speed input: 6828.35 toks[/s](http://10.169.59.132:8081/s), output: 3900.61 toks[/s](http://10.169.59.132:8081/s)]
Steps/code to reproduce the bug:
Newoutlines"""Example of integrating `outlines` with `vllm`."""importvllmfrompydanticimportBaseModelfromtransformersimportAutoTokenizerfromoutlines.models.vllmimportadapt_tokenizerfromoutlines.processorsimportJSONLogitsProcessorclassPerson(BaseModel):
name: strdescription: strMODEL_ID="/data/models/llama-v3.2-1b-it/"llm=vllm.LLM(model=MODEL_ID)
tokenizer=adapt_tokenizer(AutoTokenizer.from_pretrained(MODEL_ID))
logits_processor=JSONLogitsProcessor(schema=Person, tokenizer=tokenizer)
result=llm.generate(
["""<s>[INST] <<SYS>> You are a json text extractor. return the following json {"name": "the game name", "description": "description of the game in around 400 words"} <</SYS>> { CD Projekt Red is ramping up production on The Witcher 4, and of course it's looking into using AI } [/INST]"""]*100,
sampling_params=vllm.SamplingParams(
temperature=0.6,
max_tokens=1024,
logits_processors=[logits_processor],
),
)
print(result)
Old outlines
importrayimportvllmfrompydanticimportBaseModelfromoutlines.integrations.vllmimportJSONLogitsProcessorclassPerson(BaseModel):
name: strdescription: strMODEL_ID="/data/models/llama-v3.2-1b-it/"llm=vllm.LLM(model=MODEL_ID)
logits_processor=JSONLogitsProcessor(schema=Person, llm=llm)
result=llm.generate(
["""<s>[INST] <<SYS>> You are a json text extractor. return the following json {"name": "the game name", "description": "description of the game"} <</SYS>> { CD Projekt Red is ramping up production on The Witcher 4, and of course it's looking into using AI } [/INST]"""]*100,
sampling_params=vllm.SamplingParams(
temperature=0.6,
max_tokens=1024,
)
print(result)
### Expected result:
```shell
A similar speed, which happens if I remove the outlines call `logits_processors=[logits_processor],`
Error message:
No response
Outlines/Python version information:
You can see that from the logs
Context for the issue:
No response
The text was updated successfully, but these errors were encountered:
Describe the issue as clearly as possible:
I have some issues with vllm + outlines. It seems the performance is WAAAY worse in the new outlines version. Could you help me? Using it with an H100. I observe that the GPU compute utilization is much lower.
Note that here I use llama 3.2 1B, but the difference increase with larger models: llama 3.3 70B in a private usecase I have has ~ 800 output tokens per second while the new outlines has ~ 70 of them.
New outlines e.g. vllm v0.6.5 (see the processed prompts speed) - outlines 0.1.8
Old outlines vllm v0.6.1 - outlines 0.0.46
Steps/code to reproduce the bug:
Old outlines
Error message:
No response
Outlines/Python version information:
You can see that from the logs
Context for the issue:
No response
The text was updated successfully, but these errors were encountered: