[Performance] #21635

chakka12345677 · 2024-08-06T09:23:14Z

Describe the issue

We have converted the translation LLM 7B model to ONNX format using Optimum Hugging Face and then quantized it to 8-bit quantization with Dynamic quantization technique. However, when loading the quantized model using ONNX Runtime (ORT), the translation output is identical to the input and not translated.

To reproduce

Convert the Translation LLM 7B model to ONNX format using Optimum Hugging Face.
Quantize the ONNX model to 8-bit quantization.(Q8)
Load the quantized model using ONNX Runtime.
Perform a translation task with the loaded model.

Urgency

yes,it is
Expected Behavior:
The model should translate the input text to the target language as expected.

Actual Behavior:
The translation output is the same as the input, indicating that the model is not performing the translation task.

Platform

Linux

OS Version

20.04

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

cuda 11.6

Model File

Is this a quantized model?

Yes

yufenglee · 2024-08-15T17:29:56Z

Are you able to get right results without quantization? The naive int8 quantization doesn't work well usually. Could you please try creating and running the model with OnnxRuntime-GenAI.

And here is the example to create the model:
https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/README.md#get-the-model

chakka12345677 added the performance issues related to performance regressions label Aug 6, 2024

sophies927 added quantization issues related to quantization model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. labels Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] #21635

[Performance] #21635

chakka12345677 commented Aug 6, 2024 •

edited

Loading

yufenglee commented Aug 15, 2024

[Performance] #21635

[Performance] #21635

Comments

chakka12345677 commented Aug 6, 2024 • edited Loading

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

yufenglee commented Aug 15, 2024

chakka12345677 commented Aug 6, 2024 •

edited

Loading