High GPU memory usage when loading model #1636

jonathantainer · 2025-01-28T01:14:12Z

OS: Windows 11 Pro
CPU: Intel Core Ultra 7 165U
Memory: 64 GB DDR5
openvino-genai: 2024.6.0.0

When loading a model with LLMPipeline, the GPU memory increases to roughly double the size of the model. When the model finishes loading and it begins generating tokens, the GPU memory usage decreases to the expected amount, slightly greater than the size of the model.

I am using https://huggingface.co/Qwen/Qwen2-7B at int4 quantization, which I converted using the following command:

optimum-cli export openvino --model "Qwen/Qwen2-7B" --weight-format int4 --trust-remote-code "Qwen/Qwen2-7B"

And the following code reproduces this behavior:

import openvino_genai as ov_genai

model_path = "./Qwen/Qwen2-7B"
pipe = ov_genai.LLMPipeline(model_path, "GPU")

streamer = lambda x: print(x, end='', flush=True)
pipe.generate("The Sun is yellow because", streamer=streamer, max_new_tokens=100)

Pictured below, the GPU memory usage increases to roughly 16 GB while loading the model, then decreases to 5.7 GB while generating text.

The text was updated successfully, but these errors were encountered:

Aznie-Intel · 2025-01-30T06:53:49Z

Hi @jonathantainer , The observed GPU memory spike during model loading in OpenVINO's LLMPipeline is expected due to temporary buffer allocations, weight conversions, and graph optimizations, which temporarily increase memory usage before stabilizing during inference. To reduce this, you can use "AUTO" mode to balance CPU and GPU, enable model caching (CACHE_DIR), limit GPU memory (GPU_MEMORY_LIMIT), precompile the model, and reduce KV cache growth by limiting token generation.

jonathantainer · 2025-01-31T16:43:44Z

How would I import precompiled models into openvino.genai? As far as I can tell the API only supports loading models as openvino IR.

Aznie-Intel · 2025-02-01T01:15:47Z

Currently, openvino.genai.LLMPipeline only supports models exported using optimum-cli export openvino. If you have a precompiled OpenVINO IR model (.xml and .bin), it won’t work with LLMPipeline unless it includes the necessary metadata and tokenizer files.

zulkifli-halim assigned Aznie-Intel and Munesh-Intel Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High GPU memory usage when loading model #1636

High GPU memory usage when loading model #1636

jonathantainer commented Jan 28, 2025

Aznie-Intel commented Jan 30, 2025

jonathantainer commented Jan 31, 2025

Aznie-Intel commented Feb 1, 2025

High GPU memory usage when loading model #1636

High GPU memory usage when loading model #1636

Comments

jonathantainer commented Jan 28, 2025

Aznie-Intel commented Jan 30, 2025

jonathantainer commented Jan 31, 2025

Aznie-Intel commented Feb 1, 2025