Slow performance with the latest outlines + latest (or previous than latest vllm) #1351

denadai2 · 2024-12-30T13:56:17Z

Describe the issue as clearly as possible:

I have some issues with vllm + outlines. It seems the performance is WAAAY worse in the new outlines version. Could you help me? Using it with an H100. I observe that the GPU compute utilization is much lower.

Note that here I use llama 3.2 1B, but the difference increase with larger models: llama 3.3 70B in a private usecase I have has ~ 800 output tokens per second while the new outlines has ~ 70 of them.

New outlines e.g. vllm v0.6.5 (see the processed prompts speed) - outlines 0.1.8

WARNING 12-30 13:46:30 cuda.py:32] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
INFO 12-30 13:46:43 config.py:478] This model supports multiple tasks: {'score', 'reward', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
WARNING 12-30 13:46:43 arg_utils.py:1086] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 12-30 13:46:43 config.py:1364] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 12-30 13:46:43 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='/data/models/llama-v3.2-1b-it/', speculative_config=None, tokenizer='/data/models/llama-v3.2-1b-it/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/models/llama-v3.2-1b-it/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
INFO 12-30 13:46:45 selector.py:120] Using Flash Attention backend.
INFO 12-30 13:46:48 model_runner.py:1092] Starting to load model /data/models/llama-v3.2-1b-it/...
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.80it/s]
INFO 12-30 13:46:49 model_runner.py:1097] Loading model weights took 2.3185 GB
INFO 12-30 13:46:49 worker.py:241] Memory profiling takes 0.46 seconds
INFO 12-30 13:46:49 worker.py:241] the current vLLM instance can use total_gpu_memory (79.10GiB) x gpu_memory_utilization (0.90) = 71.19GiB
INFO 12-30 13:46:49 worker.py:241] model weights take 2.32GiB; non_torch_memory takes 0.17GiB; PyTorch activation peak memory takes 1.20GiB; the rest of the memory reserved for KV Cache is 67.50GiB.
INFO 12-30 13:46:50 gpu_executor.py:76] # GPU blocks: 138231, # CPU blocks: 8192
INFO 12-30 13:46:50 gpu_executor.py:80] Maximum concurrency for 131072 tokens per request: 16.87x
INFO 12-30 13:46:51 model_runner.py:1413] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-30 13:46:51 model_runner.py:1417] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 12-30 13:47:01 model_runner.py:1527] Graph capturing finished in 10 secs, took 0.21 GiB
INFO 12-30 13:47:01 llm_engine.py:446] init engine (profile, create kv cache, warmup model) took 12.39 seconds
Processed prompts: 100%|██████████| 100[/100](http://10.169.23.198:8081/100) [01:34<00:00,  1.06it[/s](http://10.169.23.198:8081/s), est. speed input: 81.68 toks[/s](http://10.169.23.198:8081/s), output: 104.59 toks[/s](http://10.169.23.198:8081/s)]

Old outlines vllm v0.6.1 - outlines 0.0.46

WARNING 12-30 13:48:29 arg_utils.py:930] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:29 config.py:1010] Chunked prefill is enabled with max_num_batched_tokens=512.
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:29 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/models/llama-v3.2-1b-it/', speculative_config=None, tokenizer='/data/models/llama-v3.2-1b-it/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/models/llama-v3.2-1b-it/, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:33 model_runner.py:1014] Starting to load model /data/models/llama-v3.2-1b-it/...
Loading safetensors checkpoint shards:   0% Completed | 0[/1](http://10.169.59.132:8081/1) [00:00<?, ?it[/s](http://10.169.59.132:8081/s)]
Loading safetensors checkpoint shards: 100% Completed | 1[/1](http://10.169.59.132:8081/1) [00:00<00:00,  1.98it[/s](http://10.169.59.132:8081/s)]
Loading safetensors checkpoint shards: 100% Completed | 1[/1](http://10.169.59.132:8081/1) [00:00<00:00,  1.98it[/s](http://10.169.59.132:8081/s)]
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:34 model_runner.py:1025] Loading model weights took 2.3185 GB
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:34 gpu_executor.py:122] # GPU blocks: 138166, # CPU blocks: 8192
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:36 model_runner.py:1329] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:36 model_runner.py:1333] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(asd pid=437701, ip=10.169.4.134) INFO 12-30 13:48:46 model_runner.py:1456] Graph capturing finished in 10 secs.
Compiling FSM index for all state transitions:   0%|          | 0[/47](http://10.169.59.132:8081/47) [00:00<?, ?it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:   2%|▏         | 1[/47](http://10.169.59.132:8081/47) [00:00<00:12,  3.63it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  11%|█         | 5[/47](http://10.169.59.132:8081/47) [00:00<00:02, 15.28it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  19%|█▉        | 9[/47](http://10.169.59.132:8081/47) [00:00<00:01, 22.37it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  28%|██▊       | 13[/47](http://10.169.59.132:8081/47) [00:00<00:01, 20.41it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  36%|███▌      | 17[/47](http://10.169.59.132:8081/47) [00:00<00:01, 23.81it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  45%|████▍     | 21[/47](http://10.169.59.132:8081/47) [00:00<00:00, 26.30it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  53%|█████▎    | 25[/47](http://10.169.59.132:8081/47) [00:01<00:00, 28.93it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  62%|██████▏   | 29[/47](http://10.169.59.132:8081/47) [00:01<00:00, 31.01it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  70%|███████   | 33[/47](http://10.169.59.132:8081/47) [00:01<00:00, 32.52it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  79%|███████▊  | 37[/47](http://10.169.59.132:8081/47) [00:01<00:00, 32.99it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  87%|████████▋ | 41[/47](http://10.169.59.132:8081/47) [00:01<00:00, 26.43it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions:  96%|█████████▌| 45[/47](http://10.169.59.132:8081/47) [00:01<00:00, 28.13it[/s](http://10.169.59.132:8081/s)]
Compiling FSM index for all state transitions: 100%|██████████| 47[/47](http://10.169.59.132:8081/47) [00:02<00:00, 21.38it[/s](http://10.169.59.132:8081/s)]
Processed prompts:   0%|          | 0[/100](http://10.169.59.132:8081/100) [00:00<?, ?it[/s](http://10.169.59.132:8081/s), est. speed input: 0.00 toks[/s](http://10.169.59.132:8081/s), output: 0.00 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts:   1%|          | 1[/100](http://10.169.59.132:8081/100) [00:00<00:34,  2.83it[/s](http://10.169.59.132:8081/s), est. speed input: 203.86 toks[/s](http://10.169.59.132:8081/s), output: 50.96 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts:  12%|█▏        | 12[/100](http://10.169.59.132:8081/100) [00:00<00:02, 32.77it[/s](http://10.169.59.132:8081/s), est. speed input: 1866.35 toks[/s](http://10.169.59.132:8081/s), output: 520.57 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts:  61%|██████    | 61[/100](http://10.169.59.132:8081/100) [00:00<00:00, 134.34it[/s](http://10.169.59.132:8081/s), est. speed input: 6475.71 toks[/s](http://10.169.59.132:8081/s), output: 2667.18 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts:  93%|█████████▎| 93[/100](http://10.169.59.132:8081/100) [00:00<00:00, 179.39it[/s](http://10.169.59.132:8081/s), est. speed input: 8432.94 toks[/s](http://10.169.59.132:8081/s), output: 4387.65 toks[/s](http://10.169.59.132:8081/s)]
Processed prompts: 100%|██████████| 100[/100](http://10.169.59.132:8081/100) [00:01<00:00, 94.83it[/s](http://10.169.59.132:8081/s), est. speed input: 6828.35 toks[/s](http://10.169.59.132:8081/s), output: 3900.61 toks[/s](http://10.169.59.132:8081/s)]

Steps/code to reproduce the bug:

New outlines

"""Example of integrating `outlines` with `vllm`."""

import vllm
from pydantic import BaseModel
from transformers import AutoTokenizer
from outlines.models.vllm import adapt_tokenizer

from outlines.processors import JSONLogitsProcessor


class Person(BaseModel):
    name: str
    description: str

MODEL_ID = "/data/models/llama-v3.2-1b-it/"
llm = vllm.LLM(model=MODEL_ID)
tokenizer = adapt_tokenizer(AutoTokenizer.from_pretrained(MODEL_ID))
logits_processor = JSONLogitsProcessor(schema=Person, tokenizer=tokenizer)
result = llm.generate(
    ["""<s>[INST] <<SYS>>
    You are a json text extractor. return the following json {"name": "the game name", "description": "description of the game in around 400 words"}
    <</SYS>>
    { CD Projekt Red is ramping up production on The Witcher 4, and of course it's looking into using AI } [/INST]"""]*100,
    sampling_params=vllm.SamplingParams(
        temperature=0.6,
        max_tokens=1024,
        logits_processors=[logits_processor],
    ),
)
print(result)

Old outlines

import ray

import vllm
from pydantic import BaseModel
    
from outlines.integrations.vllm import JSONLogitsProcessor
    
class Person(BaseModel):
    name: str
    description: str
    
MODEL_ID = "/data/models/llama-v3.2-1b-it/"
llm = vllm.LLM(model=MODEL_ID)
logits_processor = JSONLogitsProcessor(schema=Person, llm=llm)
result = llm.generate(
        ["""<s>[INST] <<SYS>>
    You are a json text extractor. return the following json {"name": "the game name", "description": "description of the game"}
    <</SYS>>
    { CD Projekt Red is ramping up production on The Witcher 4, and of course it's looking into using AI } [/INST]"""]*100,
        sampling_params=vllm.SamplingParams(
            temperature=0.6,
            max_tokens=1024,
    )
print(result)



### Expected result:

```shell
A similar speed, which happens if I remove the outlines call `logits_processors=[logits_processor],`

Error message:

No response

Outlines/Python version information:

You can see that from the logs

Context for the issue:

No response

The text was updated successfully, but these errors were encountered:

denadai2 added the bug label Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance with the latest outlines + latest (or previous than latest vllm) #1351

Slow performance with the latest outlines + latest (or previous than latest vllm) #1351

denadai2 commented Dec 30, 2024 •

edited

Loading

Slow performance with the latest outlines + latest (or previous than latest vllm) #1351

Slow performance with the latest outlines + latest (or previous than latest vllm) #1351

Comments

denadai2 commented Dec 30, 2024 • edited Loading

Describe the issue as clearly as possible:

Steps/code to reproduce the bug:

Error message:

Outlines/Python version information:

Context for the issue:

denadai2 commented Dec 30, 2024 •

edited

Loading