Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High GPU memory usage when loading model #1636

Open
jonathantainer opened this issue Jan 28, 2025 · 3 comments
Open

High GPU memory usage when loading model #1636

jonathantainer opened this issue Jan 28, 2025 · 3 comments
Assignees

Comments

@jonathantainer
Copy link

OS: Windows 11 Pro
CPU: Intel Core Ultra 7 165U
Memory: 64 GB DDR5
openvino-genai: 2024.6.0.0

When loading a model with LLMPipeline, the GPU memory increases to roughly double the size of the model. When the model finishes loading and it begins generating tokens, the GPU memory usage decreases to the expected amount, slightly greater than the size of the model.

I am using https://huggingface.co/Qwen/Qwen2-7B at int4 quantization, which I converted using the following command:

optimum-cli export openvino --model "Qwen/Qwen2-7B" --weight-format int4 --trust-remote-code "Qwen/Qwen2-7B"

And the following code reproduces this behavior:

import openvino_genai as ov_genai

model_path = "./Qwen/Qwen2-7B"
pipe = ov_genai.LLMPipeline(model_path, "GPU")

streamer = lambda x: print(x, end='', flush=True)
pipe.generate("The Sun is yellow because", streamer=streamer, max_new_tokens=100)

Pictured below, the GPU memory usage increases to roughly 16 GB while loading the model, then decreases to 5.7 GB while generating text.
Image

@Aznie-Intel
Copy link

Hi @jonathantainer , The observed GPU memory spike during model loading in OpenVINO's LLMPipeline is expected due to temporary buffer allocations, weight conversions, and graph optimizations, which temporarily increase memory usage before stabilizing during inference. To reduce this, you can use "AUTO" mode to balance CPU and GPU, enable model caching (CACHE_DIR), limit GPU memory (GPU_MEMORY_LIMIT), precompile the model, and reduce KV cache growth by limiting token generation.

@jonathantainer
Copy link
Author

How would I import precompiled models into openvino.genai? As far as I can tell the API only supports loading models as openvino IR.

@Aznie-Intel
Copy link

Currently, openvino.genai.LLMPipeline only supports models exported using optimum-cli export openvino. If you have a precompiled OpenVINO IR model (.xml and .bin), it won’t work with LLMPipeline unless it includes the necessary metadata and tokenizer files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants