Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] #21635

Open
chakka12345677 opened this issue Aug 6, 2024 · 1 comment
Open

[Performance] #21635

chakka12345677 opened this issue Aug 6, 2024 · 1 comment
Labels
model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. performance issues related to performance regressions quantization issues related to quantization

Comments

@chakka12345677
Copy link

chakka12345677 commented Aug 6, 2024

Describe the issue

We have converted the translation LLM 7B model to ONNX format using Optimum Hugging Face and then quantized it to 8-bit quantization with Dynamic quantization technique. However, when loading the quantized model using ONNX Runtime (ORT), the translation output is identical to the input and not translated.

To reproduce

Convert the Translation LLM 7B model to ONNX format using Optimum Hugging Face.
Quantize the ONNX model to 8-bit quantization.(Q8)
Load the quantized model using ONNX Runtime.
Perform a translation task with the loaded model.

Urgency

yes,it is
Expected Behavior:
The model should translate the input text to the target language as expected.

Actual Behavior:
The translation output is the same as the input, indicating that the model is not performing the translation task.

Platform

Linux

OS Version

20.04

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

cuda 11.6

Model File

dynamic_quantization

Is this a quantized model?

Yes

@chakka12345677 chakka12345677 added the performance issues related to performance regressions label Aug 6, 2024
@sophies927 sophies927 added quantization issues related to quantization model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. labels Aug 8, 2024
@yufenglee
Copy link
Member

Are you able to get right results without quantization? The naive int8 quantization doesn't work well usually. Could you please try creating and running the model with OnnxRuntime-GenAI.

And here is the example to create the model:
https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/README.md#get-the-model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. performance issues related to performance regressions quantization issues related to quantization
Projects
None yet
Development

No branches or pull requests

3 participants