[NPU][LNL] Run LLM inference on LNL NPU is very very slow #1563

johnysh · 2025-01-16T08:29:20Z

[OS] Win11
[Platform]: Intel(R) Core(TM) Ultra 7 258V 2.20 GHz
[RAM]: 32GB
[NPU driver]: 32.0.100.3104
ENV:

https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html

pip install nncf==2.12 onnx==1.16.1 optimum-intel==1.19.0
pip install openvino==2024.6 openvino-tokenizers==2024.6 openvino-genai==2024.6

PIP LIST:
openvino 2024.6.0
openvino-genai 2024.6.0.0
openvino-telemetry 2024.5.0
openvino-tokenizers 2024.6.0.0
optimum 1.23.3
optimum-intel 1.19.0

Code:
https://github.com/openvinotoolkit/openvino.genai

CMD:
optimum-cli export openvino -m TheBloke/Llama-2-7B-Chat-GPTQ Llama-2-7B-Chat-GPTQ

python\benchmark_genai>python ./benchmark_genai.py -m Llama-2-7B-Chat-GPTQ -d NPU

Result:

Wan-Intel · 2025-01-20T05:00:47Z

Could you please run the following command and share the result with us?
python\benchmark_genai>python ./benchmark_genai.py -m Llama-2-7B-Chat-GPTQ -d CPU

johnysh · 2025-01-20T05:18:34Z

CPU Result:

Load time: 1260.00 ms
Generate time: 1786.82 ± 66.43 ms
Tokenization time: 0.48 ± 0.04 ms
Detokenization time: 0.43 ± 0.04 ms
TTFT: 288.04 ± 43.60 ms
TPOT: 78.86 ± 19.94 ms
Throughput : 12.68 ± 3.21 tokens/s

iGPU Result:

Load time: 77268.00 ms
Generate time: 917.43 ± 2.66 ms
Tokenization time: 0.54 ± 0.04 ms
Detokenization time: 0.56 ± 0.00 ms
TTFT: 52.42 ± 0.39 ms
TPOT: 45.47 ± 2.41 ms
Throughput : 21.99 ± 1.17 tokens/s

Wan-Intel · 2025-01-23T03:22:32Z

Thanks for sharing the information. I'll escalate the case to relevant team and we'll provide an update as soon as possible.

YuChern-Intel assigned Munesh-Intel, zulkifli-halim and Wan-Intel and unassigned zulkifli-halim Jan 19, 2025

Wan-Intel added the PSE label Jan 23, 2025

avitial self-assigned this Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU][LNL] Run LLM inference on LNL NPU is very very slow #1563

[NPU][LNL] Run LLM inference on LNL NPU is very very slow #1563

johnysh commented Jan 16, 2025

Wan-Intel commented Jan 20, 2025

johnysh commented Jan 20, 2025

Wan-Intel commented Jan 23, 2025

[NPU][LNL] Run LLM inference on LNL NPU is very very slow #1563

[NPU][LNL] Run LLM inference on LNL NPU is very very slow #1563

Comments

johnysh commented Jan 16, 2025

Wan-Intel commented Jan 20, 2025

johnysh commented Jan 20, 2025

Wan-Intel commented Jan 23, 2025