Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application to T5 / UL2 family #8

Open
iiLaurens opened this issue Mar 8, 2023 · 7 comments
Open

Application to T5 / UL2 family #8

iiLaurens opened this issue Mar 8, 2023 · 7 comments

Comments

@iiLaurens
Copy link

Do you expect this to work for T5 architecture (and consequently, the very similar UL2) family? And if not, what do you suspect would be an issue and do you expect that some adjustments need to be made?

Google recently released Flan-UL2 which is 20B parameters in size. GPTQ could be a real life-saver here.

@efrantar
Copy link
Member

Hi,

in principle, we would expect GPTQ to work on most models. However, applying it to T5 models will require some additional implementation work since these are I think encoder-decoder models, which means that a memory and compute efficient GPTQ implementation (similar to the current one in the repository) would probably require sequentially traversing both the encoder and the decoder branch in parallel. See opt_sequential() or bloom_sequential() in opt.py and bloom.py for how we have implemented this sequential pass for decoder-only models.

@qwopqwop200
Copy link

Do you expect this to work for T5 architecture (and consequently, the very similar UL2) family? And if not, what do you suspect would be an issue and do you expect that some adjustments need to be made?

Google recently released Flan-UL2 which is 20B parameters in size. GPTQ could be a real life-saver here.

I tried GPTQ quantization of FALN-T5 and it seems to work successfully.
https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/t5
Additionally, I confirmed that FLAN-UL2 also works. I haven't done exact benchmarks, but it works pretty impressively.

@iiLaurens
Copy link
Author

Do you expect this to work for T5 architecture (and consequently, the very similar UL2) family? And if not, what do you suspect would be an issue and do you expect that some adjustments need to be made?
Google recently released Flan-UL2 which is 20B parameters in size. GPTQ could be a real life-saver here.

I tried GPTQ quantization of FALN-T5 and it seems to work successfully. https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/t5 Additionally, I confirmed that FLAN-UL2 also works. I haven't done exact benchmarks, but it works pretty impressively.

That's great to hear! Did you need to do anything in particular to get this to work? Did you just run GPTQ on the encoder and decoder seperately, as @efrantar seemed to suggest?

@johnrobinsn
Copy link

@qwopqwop200 this is great! Thanks much... I was able to quantize flan-t5-small... but ran into this error when trying to quantize flan-ul2...

torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 2985 is not positive-definite).

this is the command I used...

python t5.py google/flan-ul2 wikitext2 --wbits 4 --act-order --groupsize 128 --save ul2-4bit-128g.pt

any ideas?

@efrantar
Copy link
Member

efrantar commented Apr 3, 2023

This is a numerics error due to a layer-Hessian not being positive-definite, you could try to apply higher dampening --percdamp or use more calibration data --nsamples to make the Hessian more clearly positive-definite.

@johnrobinsn
Copy link

@efrantar, thanks for your feedback earlier in the thread.

I see a fairly significant drop off in performance using an attempt at 4-bit quantization on t5* models using the https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/t5 branch as compared to int8 quantization.

Some details of the perf gap are captured here.
qwopqwop200/GPTQ-for-LLaMa#157 (comment)

I am trying to understand your earlier comment in this thread and the quantization code in the referenced repo in the hope of improving the int-4 quant performance on these encoder/decoder models.

This repo appears to quantize the encoder layers (layer by layer) and then does the decoder layers (layer by layer) one after the other. You mentioned needing to do these encoder and decoder quantization in parallel (maybe I'm misunderstanding), but can you help me understand this point a bit more and why they might need to be done in parallel?

Also, any other insight would be appreciated.
Thanks!

@efrantar
Copy link
Member

I had a slightly different encoder-decoder architecture in mind when I suggested the parallel processing of branches; for T5 specifically, quantizing first the encoder and then the decoder should be correct.

We are also currently looking at some encoder-decoder models in the context of ongoing research projects; if we find anything that could be relevant for quantizing T5 with GPTQ, I will post an update here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants