[Performance] #21635
Labels
model:transformer
issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.
performance
issues related to performance regressions
quantization
issues related to quantization
Describe the issue
We have converted the translation LLM 7B model to ONNX format using Optimum Hugging Face and then quantized it to 8-bit quantization with Dynamic quantization technique. However, when loading the quantized model using ONNX Runtime (ORT), the translation output is identical to the input and not translated.
To reproduce
Convert the Translation LLM 7B model to ONNX format using Optimum Hugging Face.
Quantize the ONNX model to 8-bit quantization.(Q8)
Load the quantized model using ONNX Runtime.
Perform a translation task with the loaded model.
Urgency
yes,it is
Expected Behavior:
The model should translate the input text to the target language as expected.
Actual Behavior:
The translation output is the same as the input, indicating that the model is not performing the translation task.
Platform
Linux
OS Version
20.04
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.18.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
cuda 11.6
Model File
Is this a quantized model?
Yes
The text was updated successfully, but these errors were encountered: