-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sageattention1.0 runs slower than FA2 on A100. #70
Comments
Thank you for reaching out. You need to use the SageAttention2.0 in A100 GPUs. |
Also, SageAttention2.0 requires the CUDA version to be >= 12.4. |
Got it. |
i still don't see significant speedup on A100 @jt-zhang with CUDA 12.4 on A100 |
@nighting0le01 Could you please elaborate on the details? |
Hi, I am trying to reproduce the speed improvements of sageattention2 on A100
Details of the input: And the result is |
@SJTU-yys the sequence length in your test case is 1024, which is quite small. Can you try longer sequence lengths like 8k or 16k? |
I tried both 8192 and 16384 seq_len and get some improvements now Is these results meets your expectation? |
@SJTU-yys can you measure the result in TFlops? |
@jason-huang03 @SJTU-yys I'm curious about Sage performance under large batch sizes and variable length (varlen) conditions. Do you have any relevant benchmark on this? |
Hi @jason-huang03, I was wondering if Sage Attention 2 is consistently faster than Flash Attention 2. We have a scenario involving a 0.5B model running on an A100 MIG instance, with a relatively small sequence length and batch size (approximately batch size 8 and sequence length 384). Would it be possible to use Sage Attention to achieve better performance in this case? Thank you for your assistance! |
@lauthu |
benchmark code:
result:
version
cc @jt-zhang @jason-huang03
Any wrong or improvement of my code?
The text was updated successfully, but these errors were encountered: