Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shouldn't Burn be a lot faster than Python Transformers??? Not 8 times slower... #48

Open
profitgrowinginnovator opened this issue Nov 21, 2024 · 2 comments

Comments

@profitgrowinginnovator
Copy link

profitgrowinginnovator commented Nov 21, 2024

Describe the bug
Running the same Tiny Llava 3.1 model takes 4.59 seconds to load, 2.46 seconds to generate and generates 50.47 tokens per second with Python Transformers on CUDA Tesla P100. However with Burn it takes 9 seconds to load the model, 10 seconds to generate and only around 6.4 tokens are generated per second. What is the point of "Fast Rust" if it is almost 8 times slower or is there a mistake?

To Reproduce
Here is the python code:
import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
user_prompt = "How many helicopters can a human eat in one sitting?"
system_prompt = (
"<|system|>\nYou are a friendly chatbot who always responds in the style of a pirate\n"
"<|user|>\n{prompt}\n<|assistant|>\n"
).format(prompt=user_prompt)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

start_time = time.time()
print("Loading tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME).to(device)

if tokenizer.pad_token is None:
tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})
model.resize_token_embeddings(len(tokenizer))

load_time = time.time() - start_time
print(f"Model loaded in {load_time:.2f} seconds.")

print("Tokenizing input...")
inputs = tokenizer(
system_prompt,
return_tensors="pt",
padding=True,
truncation=True,
).to(device)

print("Generating response...")
start_time = time.time()

outputs = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
max_length=200, # Limit response length
temperature=0.7, # Adjusts randomness in output
top_p=0.9, # Nucleus sampling
do_sample=True,
)

generation_time = time.time() - start_time
num_tokens = outputs.shape[1] # Number of tokens generated
tokens_per_second = num_tokens / generation_time
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Response:\n", response)
print(f"Generation completed in {generation_time:.2f} seconds.")
print(f"Tokens generated per second: {tokens_per_second:.2f}")

and here is the burn solution:
download the llama-burn code from https://github.com/tracel-ai/models.git
cargo build --release --features tiny,cuda --example chat
./target/release/examples/chat --top-p 0.9 --temperature=0.7 --max-seq-len=200

Expected behavior
Rust should be many times faster, not 8 times slower.

Screenshots
python-transformers

burn-llama-tiny

Desktop (please complete the following information):

  • OS: Ubuntu 24.04
@laggui laggui transferred this issue from tracel-ai/burn Nov 21, 2024
@laggui
Copy link
Member

laggui commented Nov 21, 2024

Moved this to the models repo since this is an implementation issue.

The current implementation is not fully optimized 😅 the cache, sampling, flash attention come to mind.

We should be working on that pretty soon 👀

@profitgrowinginnovator
Copy link
Author

That is good to know because I like very much how Burn works in general. Please let me know when the performance is optimised so I update my blog post: https://mectors.medium.com/ai-explained-llm-performance-slow-python-transformers-fast-golang-rust-but-not-always-e3895f03c760

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants