Fix hardcoded `float` dtypes in DeBERTa model, which caused multiple RuntimeErrors in `bfloat16` #35336

bauwenst · 2024-12-19T04:58:58Z

What does this PR do?

Fix #35332 by removing any hardcoded float dtypes.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Rocketknight1 · 2024-12-19T14:07:04Z

Hi @bauwenst, I think bias terms for several models, and sensitive computations like Attention and RoPE, are intended to be kept in float32, even when the overall model is bfloat16.

I agree that there's a bug here: The hidden states are forced to float32 because of the bias addition, but I think the right solution is just to cast the output back to self.dtype afterwards, and not reduce the bias precision, and definitely not reduce the precision in RoPE/Attention!

bauwenst · 2024-12-19T16:14:20Z

I think the right solution is just to cast the output back to self.dtype afterwards, and not reduce the bias precision, and definitely not reduce the precision in RoPE/Attention!

@Rocketknight1 Isn't the point of training in bfloat16 that operations are done with 16 bits of precision? What you have laid out now is a case for not using bfloat16 at all because it indeed reduces precision in computations and in weights, but that's expected, no? Do you have any source that points out that it is particularly important for the bias to have more bits of precision than the rest of the model, rather than e.g. dense weights? To take it to an extreme: you could have your whole model be float32 and then cast down to bfloat16 whenever it is time to output hidden states. I don't see why the biases are special.

and sensitive computations like Attention and RoPE, are intended to be kept in float32, even when the overall model is bfloat16.

I doubt that this is intended in DeBERTa. When you let the bias follow the dtype of the reset of the model, everything stays bfloat16 in the attention computation. If it was intended to do attention with higher precision, surely an explicit cast operation would have been put in, rather than accidentally making the query vectors (and yet not the key vectors) float32 as a byproduct of a simple addition?

Rocketknight1 · 2024-12-19T16:58:35Z

Hi @bauwenst, float32 biases are common for two reasons:

They are very small and do not take part in any multiplication operations, so leaving them in float32 does not affect performance and only has a tiny effect on memory usage
Precision errors in biases have a much larger impact than precision errors in weights. Precision errors in weights tend to "cancel out" during matrix multiplication, and each individual weight is multiplied by a (usually small) input activation, which means its effect on the output value is also small. Biases are directly added to the final total, so their impact can be very large.

If you use quantization libraries like llama.cpp you will see this in action - weight matrices for feedforward layers lose the most precision, matrices in attention layers get higher precision, and small tensors that directly modify activations (like biases and LayerNorm weight vectors) are kept in full float32 precision. It is also common for many models in transformers to upcast to float32 for LayerNorm or attention computations.

bauwenst · 2024-12-20T05:46:21Z

@Rocketknight1 Okay, that makes some sense; do note that DeBERTa does not seem to be one of these models that upcast to do attention with higher precision.

I have two questions in that case:

Is there a model in transformers that has fixed-dtype biases but does not have the type errors fixed in my commit? This would then be the reference implementation for what to do with DeBERTa.
Should DeBERTa explicitly upcast its precision in its attention computation, given that this is a big deal?

Rocketknight1 · 2024-12-20T13:52:37Z

Hi @bauwenst, regarding 1, I think you can just add the bias term and then cast the float32 answer to self.dtype . I don't know if there's a good reference model for this. ModernBERT is the most up-to-date MLM model in transformers but it just uses the built-in bias in nn.Linear .

As for question 2, I'm afraid I don't have an answer! In general, models are added to transformers with the goal of matching their original implementation exactly. In the case of DeBERTa, the model was written 4 years ago, before bfloat16 training was used, and it uses an unusual attention mechanism, so the "right" way to handle that with reduced precision is probably an open research question, lol. If you check the codebase, most modern models do explicitly upcast to float32 for the softmax computation, because softmax can be numerically unstable, but there is some variation in whether they upcast other parts of attention.

Fix hardcoded dtypes in DeBERTa model causing range mismatches.

c96696a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hardcoded `float` dtypes in DeBERTa model, which caused multiple RuntimeErrors in `bfloat16` #35336

Fix hardcoded `float` dtypes in DeBERTa model, which caused multiple RuntimeErrors in `bfloat16` #35336

bauwenst commented Dec 19, 2024

Rocketknight1 commented Dec 19, 2024

bauwenst commented Dec 19, 2024

Rocketknight1 commented Dec 19, 2024 •

edited

Loading

bauwenst commented Dec 20, 2024

Rocketknight1 commented Dec 20, 2024

Fix hardcoded float dtypes in DeBERTa model, which caused multiple RuntimeErrors in bfloat16 #35336

Are you sure you want to change the base?

Fix hardcoded float dtypes in DeBERTa model, which caused multiple RuntimeErrors in bfloat16 #35336

Conversation

bauwenst commented Dec 19, 2024

What does this PR do?

Before submitting

Who can review?

Rocketknight1 commented Dec 19, 2024

bauwenst commented Dec 19, 2024

Rocketknight1 commented Dec 19, 2024 • edited Loading

bauwenst commented Dec 20, 2024

Rocketknight1 commented Dec 20, 2024

Fix hardcoded `float` dtypes in DeBERTa model, which caused multiple RuntimeErrors in `bfloat16` #35336

Fix hardcoded `float` dtypes in DeBERTa model, which caused multiple RuntimeErrors in `bfloat16` #35336

Rocketknight1 commented Dec 19, 2024 •

edited

Loading