Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gguf support for bloom #33473

Merged

Conversation

VladOS95-cyber
Copy link
Contributor

What does this PR do?

Add Bloom GGUF loading support

Fixes # (issue)

Before submitting

Who can review?

Regarding the task @SunMarc @LysandreJik @ArthurZucker .

src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
src/transformers/integrations/ggml.py Show resolved Hide resolved
@VladOS95-cyber
Copy link
Contributor Author

VladOS95-cyber commented Sep 17, 2024

Hi @SunMarc @LysandreJik @ArthurZucker! This PR is ready for review. There is one thing that looks odd to me. After dequantization and loading the model, It generates a wrong sequence, not as expected when using a normal pretrained model. Instead of tensor([[59414, 15, 473, 3370, 4026, 427, 5894, 861, 473, 912, 5636]]) , it generates smth like [[59414, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]]. I cannot find a root cause of this problem, i've already checked mapping and so on several times, it should be correct. It looks like that weights are not correct but I am not sure...

@SunMarc
Copy link
Member

SunMarc commented Sep 17, 2024

This PR is ready for review. There is one thing that looks odd to me. After dequantization and loading the model, It genereates a wrong sequence, not as expected when using a normal pretrained model. Instead oftensor([[59414, 15, 473, 3370, 4026, 427, 5894, 861, 473, 912, 5636]]) , it generates smth like [[59414, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]]. I cannot find a root cause of this problem, i've already checked mapping and so on several times, it should be correct. It looks like that weights are not correct but I am not sure...

Since the model was quantized, THis is normal that it is not behaving the same of the normal pretrained model. Dequantization doesn't recover the precision of the original model. Could you check that it behaves similarly as the original model that was converted to gguf in fp16 precision or full precision ? This way we have a way to compare the model loaded from gguf file.

@VladOS95-cyber
Copy link
Contributor Author

VladOS95-cyber commented Sep 18, 2024

This PR is ready for review. There is one thing that looks odd to me. After dequantization and loading the model, It genereates a wrong sequence, not as expected when using a normal pretrained model. Instead oftensor([[59414, 15, 473, 3370, 4026, 427, 5894, 861, 473, 912, 5636]]) , it generates smth like [[59414, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]]. I cannot find a root cause of this problem, i've already checked mapping and so on several times, it should be correct. It looks like that weights are not correct but I am not sure...

Since the model was quantized, THis is normal that it is not behaving the same of the normal pretrained model. Dequantization doesn't recover the precision of the original model. Could you check that it behaves similarly as the original model that was converted to gguf in fp16 precision or full precision ? This way we have a way to compare the model loaded from gguf file.

hi @SunMarc. Still odd behaviour, it does not matter which format I take. So, i debugged model_state loading and parsing logic several times and tried to compare it with working one (like llama). Everything looks good from this perspective and it seems that weights, params, config and so on are loaded correctly into the model. I'am just worring about quantized data itself.

@VladOS95-cyber
Copy link
Contributor Author

VladOS95-cyber commented Sep 23, 2024

Hi @SunMarc! I finally found the issue and the reason of bloom model strange behaviour. This is because of https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py (L972-998), this algorithm reshapes qkv data and that's why we got completely different weights in the end. I implemented reverse reshaping algorithm in modeling_gguf_pytorch_utils.py to place them back and after that, everything is ok, and model outputs expected values. Please, take a look on my changes, now, everything should be correct.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job finding the issue ! Could you add a final test to check that the fp16 from transformers and the fp16 model from gguf share the same weights ? That would be a nice test to check that the conversion was done correctly ! Can you also check that the quantized model doesn't return gibberish ?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@VladOS95-cyber
Copy link
Contributor Author

VladOS95-cyber commented Sep 23, 2024

Nice job finding the issue ! Could you add a final test to check that the fp16 from transformers and the fp16 model from gguf share the same weights ? That would be a nice test to check that the conversion was done correctly ! Can you also check that the quantized model doesn't return gibberish ?

Yes, sure, I added one more test to compare weights. But what about gibberish? Should I implement additional test to check particular generation or?

@SunMarc
Copy link
Member

SunMarc commented Sep 24, 2024

Yes, sure, I added one more test to compare weights. But what about gibberish? Should I implement additional test to check particular generation or?

Just add a simple generation check for a q4 quants for example, just like what was done for other models !

@VladOS95-cyber
Copy link
Contributor Author

Yes, sure, I added one more test to compare weights. But what about gibberish? Should I implement additional test to check particular generation or?

Just add a simple generation check for a q4 quants for example, just like what was done for other models !

But this kind of test is already added, it is called test_bloom_f16. Or is it not enough?

@SunMarc
Copy link
Member

SunMarc commented Sep 25, 2024

But this kind of test is already added, it is called test_bloom_f16. Or is it not enough?

But we need one for a quantized model since most users uses these models. With fp16, we can't be sure that the dequantize step was done correctly.

@VladOS95-cyber
Copy link
Contributor Author

But this kind of test is already added, it is called test_bloom_f16. Or is it not enough?

But we need one for a quantized model since most users uses these models. With fp16, we can't be sure that the dequantize step was done correctly.

Hello! Ok, got it, I just added q8_0 test as well. Would it be sufficient?

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you @VladOS95-cyber!

@LysandreJik LysandreJik merged commit 9d200cf into huggingface:main Sep 27, 2024
22 of 24 checks passed
BenjaminBossan pushed a commit to BenjaminBossan/transformers that referenced this pull request Sep 30, 2024
* add bloom arch support for gguf

* apply format

* small refactoring, bug fix in GGUF_TENSOR_MAPPING naming

* optimize bloom GGUF_TENSOR_MAPPING

* implement reverse reshaping for bloom gguf

* add qkv weights test

* add q_8 test for bloom
amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Oct 2, 2024
* add bloom arch support for gguf

* apply format

* small refactoring, bug fix in GGUF_TENSOR_MAPPING naming

* optimize bloom GGUF_TENSOR_MAPPING

* implement reverse reshaping for bloom gguf

* add qkv weights test

* add q_8 test for bloom
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants