issue in quantised model generated response 

iam referring to [https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama/smooth_quant](url) for quantising llama chat model and then do an innference on it. I have successfully created a quantised version however the response from the model is not satysfying. I have provided the code snippet that iam using to do inference.
```
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import pipeline, LlamaTokenizer
import torch

onnx_path = "./onnx_q/"
opt_model = ORTModelForCausalLM.from_pretrained(onnx_path, file_name="model.onnx").to('cuda')
tokenizer = LlamaTokenizer.from_pretrained(onnx_path)
opt_optimum_generator = pipeline("text-generation", model=opt_model, tokenizer=tokenizer, device='cuda')
prompt = "what is ai ?"
generated_text = opt_optimum_generator(prompt, max_length=254, num_return_sequences=1, truncation=True)
print(generated_text[0]['generated_text']) 
```
Am I doing something wromg ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

issue in quantised model generated response #450

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

issue in quantised model generated response #450

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions