You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Running the same Tiny Llava 3.1 model takes 4.59 seconds to load, 2.46 seconds to generate and generates 50.47 tokens per second with Python Transformers on CUDA Tesla P100. However with Burn it takes 9 seconds to load the model, 10 seconds to generate and only around 6.4 tokens are generated per second. What is the point of "Fast Rust" if it is almost 8 times slower or is there a mistake?
To Reproduce
Here is the python code:
import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
user_prompt = "How many helicopters can a human eat in one sitting?"
system_prompt = (
"<|system|>\nYou are a friendly chatbot who always responds in the style of a pirate\n"
"<|user|>\n{prompt}\n<|assistant|>\n"
).format(prompt=user_prompt)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
start_time = time.time()
print("Loading tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME).to(device)
if tokenizer.pad_token is None:
tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})
model.resize_token_embeddings(len(tokenizer))
load_time = time.time() - start_time
print(f"Model loaded in {load_time:.2f} seconds.")
generation_time = time.time() - start_time
num_tokens = outputs.shape[1] # Number of tokens generated
tokens_per_second = num_tokens / generation_time
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Response:\n", response)
print(f"Generation completed in {generation_time:.2f} seconds.")
print(f"Tokens generated per second: {tokens_per_second:.2f}")
and here is the burn solution:
download the llama-burn code from https://github.com/tracel-ai/models.git
cargo build --release --features tiny,cuda --example chat
./target/release/examples/chat --top-p 0.9 --temperature=0.7 --max-seq-len=200
Expected behavior
Rust should be many times faster, not 8 times slower.
Screenshots
Desktop (please complete the following information):
OS: Ubuntu 24.04
The text was updated successfully, but these errors were encountered:
Describe the bug
Running the same Tiny Llava 3.1 model takes 4.59 seconds to load, 2.46 seconds to generate and generates 50.47 tokens per second with Python Transformers on CUDA Tesla P100. However with Burn it takes 9 seconds to load the model, 10 seconds to generate and only around 6.4 tokens are generated per second. What is the point of "Fast Rust" if it is almost 8 times slower or is there a mistake?
To Reproduce
Here is the python code:
import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
user_prompt = "How many helicopters can a human eat in one sitting?"
system_prompt = (
"<|system|>\nYou are a friendly chatbot who always responds in the style of a pirate\n"
"<|user|>\n{prompt}\n<|assistant|>\n"
).format(prompt=user_prompt)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
start_time = time.time()
print("Loading tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME).to(device)
if tokenizer.pad_token is None:
tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})
model.resize_token_embeddings(len(tokenizer))
load_time = time.time() - start_time
print(f"Model loaded in {load_time:.2f} seconds.")
print("Tokenizing input...")
inputs = tokenizer(
system_prompt,
return_tensors="pt",
padding=True,
truncation=True,
).to(device)
print("Generating response...")
start_time = time.time()
outputs = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
max_length=200, # Limit response length
temperature=0.7, # Adjusts randomness in output
top_p=0.9, # Nucleus sampling
do_sample=True,
)
generation_time = time.time() - start_time
num_tokens = outputs.shape[1] # Number of tokens generated
tokens_per_second = num_tokens / generation_time
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Response:\n", response)
print(f"Generation completed in {generation_time:.2f} seconds.")
print(f"Tokens generated per second: {tokens_per_second:.2f}")
and here is the burn solution:
download the llama-burn code from https://github.com/tracel-ai/models.git
cargo build --release --features tiny,cuda --example chat
./target/release/examples/chat --top-p 0.9 --temperature=0.7 --max-seq-len=200
Expected behavior
Rust should be many times faster, not 8 times slower.
Screenshots
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: