Open
Description
iam referring to https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama/smooth_quant for quantising llama chat model and then do an innference on it. I have successfully created a quantised version however the response from the model is not satysfying. I have provided the code snippet that iam using to do inference.
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import pipeline, LlamaTokenizer
import torch
onnx_path = "./onnx_q/"
opt_model = ORTModelForCausalLM.from_pretrained(onnx_path, file_name="model.onnx").to('cuda')
tokenizer = LlamaTokenizer.from_pretrained(onnx_path)
opt_optimum_generator = pipeline("text-generation", model=opt_model, tokenizer=tokenizer, device='cuda')
prompt = "what is ai ?"
generated_text = opt_optimum_generator(prompt, max_length=254, num_return_sequences=1, truncation=True)
print(generated_text[0]['generated_text'])
Am I doing something wromg ?
Metadata
Metadata
Assignees
Labels
No labels