Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On A100 card, speed-up effect does not show up. #51

Open
leocnj opened this issue Nov 30, 2023 · 2 comments
Open

On A100 card, speed-up effect does not show up. #51

leocnj opened this issue Nov 30, 2023 · 2 comments

Comments

@leocnj
Copy link

leocnj commented Nov 30, 2023

First, thanks very much for creating this cool technology.

On one A100 GPU w/ 80GB VRAM, I tried benchmarking sq-vicuna-7b-v1.3-w3-s0 and its base. It is a bit strange that running median time has not been reduced a lot. This seems different to the speed-up results reported in your paper. Do you mind helping on tracing a possible reason? Is it related to my experiment was on a more powerful GPU?

Median PPL max memory(MiB)
w3-s0 0.025365471839904785 16.07021141052246 3602.3271484375
FP16 0.02616262435913086 14.921088218688965 25906.5771484375

Script:

#!/bin/bash

# vicuna v1.3 Benchmarking
CUDA_VISIBLE_DEVICES=0 python llama.py models/sq-vicuna-7b-v1.3-w3-s0 c4 --wbits 3 --load models/sq-vicuna-7b-v1.3-w3-s0/sq-vicuna-7b-v1.3-w3-s0.pt --benchmark 128 --check 

# vicuna v1.3 base
# HF naming can use cache
CUDA_VISIBLE_DEVICES=0 python llama.py lmsys/vicuna-7b-v1.3 c4 --wbits 16 --benchmark 128 --check

@shiqingzhangCSU
Copy link

Mabey the kernel is under optimized.

@Qubitium
Copy link

Also keep mind that inference dequantizing is also much more dependent on cpu vs native bf/fp16 models. We have seen 2.5x improvement in running quantized models on same gpu but different cpu/memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants