flash-attention-3 #33522

hlky · 2024-09-17T01:27:25Z

What does this PR do?

This PR adds preliminary support for Flash Attention 3.

is_flash_attn_3_available required a workaround in _is_package_available as package_version = importlib.metadata.version(pkg_name) fails with importlib.metadata.PackageNotFoundError: No package metadata was found for flash_attn_interface.
_supports_flash_attn_3 and _check_and_enable_flash_attn_3 added to modeling_utils.py, near duplicate of _check_and_enable_flash_attn_2.
~~_flash_attention_3_forward implemented in modeling_flash_attention_3_utils.py~~ _flash_attention_forward is now a unified interface for FAv2 and FAv3 controlled by use_flash_attn_3 which is passed from FlashAttention classes based on config._attn_implementation == "flash_attention_3".
- Currently FAv3 does not support dropout, ~~sliding window~~ (edit: sliding window is now supported) or softcap, and in FAv3 flash_attn_func/flash_attn_varlen_func return a tuple.
- attention_mask is not None and position_ids is not None paths depend on _upad_input and prepare_fa2_from_position_ids respectively, ~~these are duplicated from modeling_flash_attention_utils.py~~ and are not included in FAv3 package therefore FAv3 depends on flash_attn, this is reflected in is_flash_attn_3_available which checks for is_flash_attn_2_available.
- In the remaining path FAv3 supports FP8, this PR currently uses environment variable FLASH_ATTENTION_3_FP8 for this purpose, we can probably add something like attention_kwargs to model forwards to control this, or maybe another _attn_implementation type flash_attention_3_fp8, best to get reviews first and consensus on the best way to do it[1]
~~flash_attention_3 is added to Llama with LlamaFlashAttention3, similar to LlamaFlashAttention2 with unsupported options like dropout and sliding window removed.~~ Edit: added to other models, see comment below.
~~_update_causal_mask is updated in various models due to utils/check_copies.py, and _supports_flash_attn_3 is added in to some other models already for the same reason.~~ See comment below.

Fixes #33373

Todo

~~Test attention_mask is not None and position_ids is not None paths~~
~~Implement FlashAttention3 classes for other models~~ Done.
FP8 usage[1]
~~Documentation~~ Partly done.
Benchmarks would be nice

Notes

Llama tested on H100 SXM with:

import torch
from transformers import AutoTokenizer, LlamaForCausalLM

tokenizer = AutoTokenizer.from_pretrained('NousResearch/Hermes-3-Llama-3.1-8B', trust_remote_code=True)
model = LlamaForCausalLM.from_pretrained(
    "NousResearch/Hermes-3-Llama-3.1-8B",
    torch_dtype=torch.float16,
    device_map="auto",
    attn_implementation="flash_attention_3"
)

prompts = [
    """<|im_start|>system
You are a sentient, superintelligent artificial general intelligence, here to teach and assist me.<|im_end|>
<|im_start|>user
Write a short story about Goku discovering kirby has teamed up with Majin Buu to destroy the world.<|im_end|>
<|im_start|>assistant""",
    ]

for chat in prompts:
    print(chat)
    input_ids = tokenizer(chat, return_tensors="pt").input_ids.to("cuda")
    generated_ids = model.generate(input_ids, max_new_tokens=750, temperature=0.8, repetition_penalty=1.1, do_sample=True, eos_token_id=tokenizer.eos_token_id)
    response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True, clean_up_tokenization_space=True)
    print(f"Response: {response}")

(shortened) responses

FP16:

In the vast expanse of the universe, there existed a celestial realm where beings of extraordinary powers roamed freely. One such being was Goku, the legendary warrior known for his indomitable spirit and unconquerable will.

One fateful day, as Goku trained under the golden sun, he sensed an unusual disturbance in the cosmic energy. Puzzled by this anomaly, Goku rushed to the source of the disturbance and arrived at the edge of a hidden dimension.

To his horror, Goku witnessed Kirby and Majin Buu working together, devising devious plans to obliterate entire planets. Their combined strength was formidable, their intentions sinister, and their alliance unprecedented.

FP8:

In the vast expanse of the universe, there existed a planet called Planet Vegeta, where the powerful Saiyan warrior, Goku, lived among his friends in the city of Earth.

One fateful day, Goku was training on the top of a mountain when he sensed an unusual energy signature. His spidey senses immediately tingled with suspicion.

"Who could that be?" he wondered aloud as he leapt into the sky, soaring towards the source of the disturbance.

As Goku arrived at the scene, he discovered something utterly shocking - Kirby, the infamous villain from another galaxy, had formed an alliance with Majin Buu, the mischievous yet formidable entity who had caused Goku so much trouble in the past.

~~All other models will be tested after I've finished adding FlashAttention3 classes.~~ Other models have been tested, see comment below.

Who can review?

cc @ArthurZucker

hlky · 2024-09-17T14:34:19Z

FlashAttention3 classes added to the models that had to be modified due to utils/check_copies.py. The following models do not currently support Flash Attention and were only modified due to utils/check_copies.py: bloom, codegen, gpt_neox_japanese, idefics, persimmon.

~~There are more models that support FAv2, these will be done next.~~

~~Note that Sliding Window should be supported soon, after Dao-AILab/flash-attention#1233~~

hlky · 2024-09-17T18:05:36Z

All models supporting FAv2 should now have FAv3 classes.

The following models will need sliding window adding back in when available:
chameleon/modeling_chameleon.py
gemma/modeling_gemma.py
gemma2/modeling_gemma2.py
granite/modeling_granite.py
idefics2/modeling_idefics2.py
jamba/modeling_jamba.py
llama/modeling_llama.py
mistral/modeling_mistral.py
mixtral/modeling_mixtral.py
nemotron/modeling_nemotron.py
phi3/modeling_phi3.py
qwen2/modeling_qwen2.py
qwen2_moe/modeling_qwen2_moe.py
qwen2_vl/modeling_qwen2_vl.py
starcoder2/modeling_starcoder2.py

~~There are some areas that use config._attn_implementation == "flash_attention_2" that I'll update next.~~

hlky · 2024-09-17T18:50:10Z

All occurrences of config._attn_implementation == .../self._use_flash_attention_ and other mentions of "flash_attention_2" should now be updated with FAv3 versions. This should be about it on the modeling side with the exception of sliding window.

Documentation and tests will be done next.

hlky · 2024-09-17T20:16:25Z

Some documentation and all tests are updated for FAv3. I'll run the tests on a H100 instance then mark this as ready for (initial) review.

hlky · 2024-09-17T23:17:23Z

Generally FAv3 tests are failing due to the small configurations used: RuntimeError: Only support head size 64, 128, and 256 for now.

Instead I've tested the majority of models from their examples, with a few exceptions like Gemma and Mistral that I need to request access to, and particularly large models such as Jamba that my instance doesn't have space to download.

All of the tested models with examples are ok, with the exception of HuggingFaceM4/idefics2-8b-base:

  File "/workspace/transformers/src/transformers/modeling_flash_attention_3_utils.py", line 118, in _upad_input
    query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: too many values to unpack (expected 4)

However this error also occurs with flash_attention_2.

StableLM models are currently not supported due to num_attention_heads/Only support head size 64, 128, and 256 for now.

I've attached test reports, the numerical accuracy failures may need special care as per flash-attention/hopper/test_flash_attn.py here and here

test_report.zip

ArthurZucker

Wowowo super nice initiative thanks! 🔥
IMO since we already abstracted the flash attention API, let's try to keep it in flashAttentionLlama but maybe support flash_attention_3 in the attn_implementation for example! WDYT?

ArthurZucker · 2024-09-17T23:18:37Z

src/transformers/models/llama/modeling_llama.py

+            value_states = value_states.to(target_dtype)
+
+        # TODO: get `use_fp8` to here, add attention_kwargs or something
+        attn_output = _flash_attention_3_forward(


Hey! As far as I can tell, the only diff is the forward function right?

Yeah the difference between FlashAttention2 classes and FlashAttention3 is just the forward function, and lack of dropout/sliding window/softcap for FAv3. As you suggest we could support v3 in the existing classes instead using config.attn_implementation to select the appropriate function, happy to make this change if you think that's better.

ArthurZucker · 2024-09-17T23:19:12Z

src/transformers/modeling_flash_attention_3_utils.py

+    return (query, key, value, indices_q, (cu_seq_lens, cu_seq_lens), (max_length, max_length))
+
+
+def _flash_attention_3_forward(


let's maybe replace flash_attention_forward by this one when flash attention3 is available WDYT?

AFAIK FAv3 will be for Hopper GPUs only

hlky · 2024-09-18T16:53:03Z

I've replaced the FlashAttention3 classes and integrated it into existing FlashAttention2 classes. self._flash_attn_3 = self.config._attn_implementation == "flash_attention_3" is added to control which version to use with if self._flash_attn_3:.

I've renamed the FlashAttention2 classes to just FlashAttention as ATTENTION_CLASSES would look strange doing e.g. "flash_attention_3": Qwen2VLFlashAttention2

Note that while I was checking all Attention classes contain config I added some missing type annotations to the config parameters, I then had to add a few more type due to Copied from.

In src/transformers/models/qwen2_vl/modeling_qwen2_vl.py I've changed VisionFlashAttention and VisionSdpaAttention to subclass VisionAttention and added config as a parameter, this was needed for self.config._attn_implementation == "flash_attention_3".

We could simplify the changes to FlashAttention classes further by creating a wrapper for both _flash_attention_forward and _flash_attention_3_forward with the above _flash_attn_3 as a parameter.

Checks like config._attn_implementation != "flash_attention_2"/config._attn_implementation == "flash_attention_2" could also be changed to something like "flash_attention" not in config._attn_implementation/"flash_attention" in config._attn_implementation.

hlky · 2024-09-20T10:53:21Z

Sliding window is now supported.

vasqu · 2024-09-20T15:39:12Z

Very great work 🚀 just a passerby who looked into the code :)

We could simplify the changes to FlashAttention classes further by creating a wrapper for both _flash_attention_forward and _flash_attention_3_forward with the above _flash_attn_3 as a parameter.

I'd be very pro this. It kinda looks misleading now with _flash_attention_forward which is meant to model the FA2 forward. If we keep it as a unified interface and an FA3 flag (to give control over the implementation used), it makes more sense imo.

Personal preference: I'd go a step further and move the modeling_flash_attention_3_utils.py file's content into the general modeling_flash_attention_utils.py file together with the combined interface.

Checks like config._attn_implementation != "flash_attention_2"/config._attn_implementation == "flash_attention_2" could also be changed to something like "flash_attention" not in config._attn_implementation/"flash_attention" in config._attn_implementation.

Seems reasonable to me. Makes the code less verbose too.

Lastly, _check_and_enable_flash_attn_3 in modeling_utils.py could benefit from a compute capability check since FA3 has a hard dependency on hopper gpus (9.0 iirc).

Edit: Maybe raising a value error / warning if dropout or similar values are passed, would also be nice since now it's just silently ignoring them.

hlky · 2024-09-20T18:21:34Z

I've removed modeling_flash_attention_3_utils.py and unified the interface for FAv2 and FAv3 in _flash_attention_forward.

I'll wait for input from a maintainer on changing the checks (config._attn_implementation != "flash_attention_2"/config._attn_implementation == "flash_attention_2") in case there's some preference.

~~Compute capability check would indeed be useful, I'll add this in the next commit.~~ Done.

I assume this won't be merged until FAv3 is out of beta, at which point dropout and softcap should hopefully be supported, if not then I agree we should add an error/warning if they're used with FAv3.

hlky force-pushed the flash-attention-3 branch 3 times, most recently from b6afd63 to 0976545 Compare September 17, 2024 14:28

hlky force-pushed the flash-attention-3 branch 2 times, most recently from 5aa58ab to 7ae105e Compare September 17, 2024 17:49

hlky force-pushed the flash-attention-3 branch from 7ae105e to bd6e9e7 Compare September 17, 2024 18:33

hlky force-pushed the flash-attention-3 branch from bd6e9e7 to fbf9bec Compare September 17, 2024 20:14

hlky force-pushed the flash-attention-3 branch from fbf9bec to 27edb62 Compare September 17, 2024 23:16

ArthurZucker reviewed Sep 17, 2024

View reviewed changes

hlky force-pushed the flash-attention-3 branch from 27edb62 to ba268ef Compare September 18, 2024 16:31

hlky force-pushed the flash-attention-3 branch 3 times, most recently from 4473129 to 4a34da8 Compare September 20, 2024 10:51

hlky force-pushed the flash-attention-3 branch from 4a34da8 to 9bcbe3f Compare September 20, 2024 18:06

flash-attention-3

2146a74

hlky force-pushed the flash-attention-3 branch from 9bcbe3f to 2146a74 Compare September 21, 2024 00:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flash-attention-3 #33522

flash-attention-3 #33522

hlky commented Sep 17, 2024 •

edited

Loading

hlky commented Sep 17, 2024 •

edited

Loading

hlky commented Sep 17, 2024 •

edited

Loading

hlky commented Sep 17, 2024

hlky commented Sep 17, 2024

hlky commented Sep 17, 2024

ArthurZucker left a comment

ArthurZucker Sep 17, 2024

hlky Sep 17, 2024

ArthurZucker Sep 17, 2024

hlky Sep 17, 2024

hlky commented Sep 18, 2024

hlky commented Sep 20, 2024

vasqu commented Sep 20, 2024 •

edited

Loading

hlky commented Sep 20, 2024 •

edited

Loading

		return (query, key, value, indices_q, (cu_seq_lens, cu_seq_lens), (max_length, max_length))


		def _flash_attention_3_forward(

flash-attention-3 #33522

Are you sure you want to change the base?

flash-attention-3 #33522

Conversation

hlky commented Sep 17, 2024 • edited Loading

What does this PR do?

Todo

Notes

Who can review?

hlky commented Sep 17, 2024 • edited Loading

hlky commented Sep 17, 2024 • edited Loading

hlky commented Sep 17, 2024

hlky commented Sep 17, 2024

hlky commented Sep 17, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Sep 17, 2024

Choose a reason for hiding this comment

hlky Sep 17, 2024

Choose a reason for hiding this comment

ArthurZucker Sep 17, 2024

Choose a reason for hiding this comment

hlky Sep 17, 2024

Choose a reason for hiding this comment

hlky commented Sep 18, 2024

hlky commented Sep 20, 2024

vasqu commented Sep 20, 2024 • edited Loading

hlky commented Sep 20, 2024 • edited Loading

hlky commented Sep 17, 2024 •

edited

Loading

hlky commented Sep 17, 2024 •

edited

Loading

hlky commented Sep 17, 2024 •

edited

Loading

vasqu commented Sep 20, 2024 •

edited

Loading

hlky commented Sep 20, 2024 •

edited

Loading