-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2.6.3 is faster than 2.7.0 for flash-attn v2 CUDA fwd/bwd #1335
Comments
I'm guessing it's because we moved some of the checks and padding (i.e. checking if headdim not a multiple 8) from C++ to Python for compatibility with torch compile. This might add a bit more Python overhead so it's noticable for small batch and short sequences (since the kernel will be very fast there). |
What would be helpful is to get the profiler result (e.g. pytorch profiler or nsight systems) to see the kernel time. e.g. if the kernel time stays the same then we can say it's because of Python overhead. If the kernel time is very different then we'll need to investigate. |
Hey, thanks for the quick reply! I’ve been profiling with nsys on the A100 and can conclude that it’s likely Python overhead, as the kernel times appear identical for both versions 2.6.3 and 2.7.0post2. I’m checking forward/backward passes for the same dimensions as mentioned earlier. Unfortunately, it seems that Python overhead becomes quite significant, especially when targeting smaller Q/K lengths and/or batch sizes.
Yeah, we should introduce it as a baseline I guess. Will test it soon. ATM, this thread can be closed :) Thanks! |
Hey, I have observed in my timing tests that version 2.6.3 is faster than some later commits (including 2.7.0.post2) for below input sizes. For example, for small batch sizes (==2) and relatively small sequences, 2.6.3 is even 2x faster for me in the forward pass.
My setup: 4070 Laptop (CUDA 12) and A100 (CUDA 11), Torch 2.4. Both flash-attn versions were installed via pip install directly from PyPI. Below are results measured with a custom Python script with proper CUDA synchronization.
Minimal instructions to replicate:
# set-up environment for flash attention, install torch pip install loguru pip install pytest pip install flash-attn==2.6.3 --no-build-isolation pytest -s test_min_example.py pip uninstall flash-attn==2.6.3 pip install flash-attn==2.7.0.post2 --no-build-isolation pytest -s test_min_example.py
test_min_example.py
Could you please help me understand what might be the source of these timing differences? When going through the source code, it seems to me that the kernel code is the same, the CUTLASS submodule repo pointer is the same, and the only changes are in the API in C++/Python, which relate to head, head_size_og, and padding. Also, my embedding sizes and head numbers are divisible by 8.
The text was updated successfully, but these errors were encountered: