Shouldn't Burn be a lot faster than Python Transformers??? Not 8 times slower... #48

profitgrowinginnovator · 2024-11-21T16:44:04Z

Describe the bug
Running the same Tiny Llava 3.1 model takes 4.59 seconds to load, 2.46 seconds to generate and generates 50.47 tokens per second with Python Transformers on CUDA Tesla P100. However with Burn it takes 9 seconds to load the model, 10 seconds to generate and only around 6.4 tokens are generated per second. What is the point of "Fast Rust" if it is almost 8 times slower or is there a mistake?

To Reproduce
Here is the python code:
import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
user_prompt = "How many helicopters can a human eat in one sitting?"
system_prompt = (
"<|system|>\nYou are a friendly chatbot who always responds in the style of a pirate\n"
"<|user|>\n{prompt}\n<|assistant|>\n"
).format(prompt=user_prompt)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

start_time = time.time()
print("Loading tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME).to(device)

if tokenizer.pad_token is None:
tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})
model.resize_token_embeddings(len(tokenizer))

load_time = time.time() - start_time
print(f"Model loaded in {load_time:.2f} seconds.")

print("Tokenizing input...")
inputs = tokenizer(
system_prompt,
return_tensors="pt",
padding=True,
truncation=True,
).to(device)

print("Generating response...")
start_time = time.time()

outputs = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
max_length=200, # Limit response length
temperature=0.7, # Adjusts randomness in output
top_p=0.9, # Nucleus sampling
do_sample=True,
)

generation_time = time.time() - start_time
num_tokens = outputs.shape[1] # Number of tokens generated
tokens_per_second = num_tokens / generation_time
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Response:\n", response)
print(f"Generation completed in {generation_time:.2f} seconds.")
print(f"Tokens generated per second: {tokens_per_second:.2f}")

and here is the burn solution:
download the llama-burn code from https://github.com/tracel-ai/models.git
cargo build --release --features tiny,cuda --example chat
./target/release/examples/chat --top-p 0.9 --temperature=0.7 --max-seq-len=200

Expected behavior
Rust should be many times faster, not 8 times slower.

Screenshots

Desktop (please complete the following information):

OS: Ubuntu 24.04

laggui · 2024-11-21T18:41:26Z

Moved this to the models repo since this is an implementation issue.

The current implementation is not fully optimized 😅 the cache, sampling, flash attention come to mind.

We should be working on that pretty soon 👀

profitgrowinginnovator · 2024-11-22T07:11:06Z

That is good to know because I like very much how Burn works in general. Please let me know when the performance is optimised so I update my blog post: https://mectors.medium.com/ai-explained-llm-performance-slow-python-transformers-fast-golang-rust-but-not-always-e3895f03c760

laggui transferred this issue from tracel-ai/burn Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shouldn't Burn be a lot faster than Python Transformers??? Not 8 times slower... #48

Shouldn't Burn be a lot faster than Python Transformers??? Not 8 times slower... #48

profitgrowinginnovator commented Nov 21, 2024 •

edited

Loading

laggui commented Nov 21, 2024

profitgrowinginnovator commented Nov 22, 2024

Shouldn't Burn be a lot faster than Python Transformers??? Not 8 times slower... #48

Shouldn't Burn be a lot faster than Python Transformers??? Not 8 times slower... #48

Comments

profitgrowinginnovator commented Nov 21, 2024 • edited Loading

laggui commented Nov 21, 2024

profitgrowinginnovator commented Nov 22, 2024

profitgrowinginnovator commented Nov 21, 2024 •

edited

Loading