You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OS: Windows 11 Pro
CPU: Intel Core Ultra 7 165U
Memory: 64 GB DDR5
openvino-genai: 2024.6.0.0
When loading a model with LLMPipeline, the GPU memory increases to roughly double the size of the model. When the model finishes loading and it begins generating tokens, the GPU memory usage decreases to the expected amount, slightly greater than the size of the model.
Hi @jonathantainer , The observed GPU memory spike during model loading in OpenVINO's LLMPipeline is expected due to temporary buffer allocations, weight conversions, and graph optimizations, which temporarily increase memory usage before stabilizing during inference. To reduce this, you can use "AUTO" mode to balance CPU and GPU, enable model caching (CACHE_DIR), limit GPU memory (GPU_MEMORY_LIMIT), precompile the model, and reduce KV cache growth by limiting token generation.
Currently, openvino.genai.LLMPipeline only supports models exported using optimum-cli export openvino. If you have a precompiled OpenVINO IR model (.xml and .bin), it won’t work with LLMPipeline unless it includes the necessary metadata and tokenizer files.
OS: Windows 11 Pro
CPU: Intel Core Ultra 7 165U
Memory: 64 GB DDR5
openvino-genai: 2024.6.0.0
When loading a model with
LLMPipeline
, the GPU memory increases to roughly double the size of the model. When the model finishes loading and it begins generating tokens, the GPU memory usage decreases to the expected amount, slightly greater than the size of the model.I am using https://huggingface.co/Qwen/Qwen2-7B at int4 quantization, which I converted using the following command:
And the following code reproduces this behavior:
Pictured below, the GPU memory usage increases to roughly 16 GB while loading the model, then decreases to 5.7 GB while generating text.
The text was updated successfully, but these errors were encountered: