Application to T5 / UL2 family #8

iiLaurens · 2023-03-08T13:21:31Z

Do you expect this to work for T5 architecture (and consequently, the very similar UL2) family? And if not, what do you suspect would be an issue and do you expect that some adjustments need to be made?

Google recently released Flan-UL2 which is 20B parameters in size. GPTQ could be a real life-saver here.

efrantar · 2023-03-22T13:35:16Z

Hi,

in principle, we would expect GPTQ to work on most models. However, applying it to T5 models will require some additional implementation work since these are I think encoder-decoder models, which means that a memory and compute efficient GPTQ implementation (similar to the current one in the repository) would probably require sequentially traversing both the encoder and the decoder branch in parallel. See opt_sequential() or bloom_sequential() in opt.py and bloom.py for how we have implemented this sequential pass for decoder-only models.

qwopqwop200 · 2023-04-03T15:03:21Z

Do you expect this to work for T5 architecture (and consequently, the very similar UL2) family? And if not, what do you suspect would be an issue and do you expect that some adjustments need to be made?

Google recently released Flan-UL2 which is 20B parameters in size. GPTQ could be a real life-saver here.

I tried GPTQ quantization of FALN-T5 and it seems to work successfully.
https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/t5
Additionally, I confirmed that FLAN-UL2 also works. I haven't done exact benchmarks, but it works pretty impressively.

iiLaurens · 2023-04-03T15:56:30Z

Do you expect this to work for T5 architecture (and consequently, the very similar UL2) family? And if not, what do you suspect would be an issue and do you expect that some adjustments need to be made?
Google recently released Flan-UL2 which is 20B parameters in size. GPTQ could be a real life-saver here.

I tried GPTQ quantization of FALN-T5 and it seems to work successfully. https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/t5 Additionally, I confirmed that FLAN-UL2 also works. I haven't done exact benchmarks, but it works pretty impressively.

That's great to hear! Did you need to do anything in particular to get this to work? Did you just run GPTQ on the encoder and decoder seperately, as @efrantar seemed to suggest?

johnrobinsn · 2023-04-03T18:32:51Z

@qwopqwop200 this is great! Thanks much... I was able to quantize flan-t5-small... but ran into this error when trying to quantize flan-ul2...

torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 2985 is not positive-definite).

this is the command I used...

python t5.py google/flan-ul2 wikitext2 --wbits 4 --act-order --groupsize 128 --save ul2-4bit-128g.pt

any ideas?

efrantar · 2023-04-03T20:02:15Z

This is a numerics error due to a layer-Hessian not being positive-definite, you could try to apply higher dampening --percdamp or use more calibration data --nsamples to make the Hessian more clearly positive-definite.

johnrobinsn · 2023-04-11T17:28:07Z

@efrantar, thanks for your feedback earlier in the thread.

I see a fairly significant drop off in performance using an attempt at 4-bit quantization on t5* models using the https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/t5 branch as compared to int8 quantization.

Some details of the perf gap are captured here.
qwopqwop200/GPTQ-for-LLaMa#157 (comment)

I am trying to understand your earlier comment in this thread and the quantization code in the referenced repo in the hope of improving the int-4 quant performance on these encoder/decoder models.

This repo appears to quantize the encoder layers (layer by layer) and then does the decoder layers (layer by layer) one after the other. You mentioned needing to do these encoder and decoder quantization in parallel (maybe I'm misunderstanding), but can you help me understand this point a bit more and why they might need to be done in parallel?

Also, any other insight would be appreciated.
Thanks!

efrantar · 2023-07-11T13:28:30Z

I had a slightly different encoder-decoder architecture in mind when I suggested the parallel processing of branches; for T5 specifically, quantizing first the encoder and then the decoder should be correct.

We are also currently looking at some encoder-decoder models in the context of ongoing research projects; if we find anything that could be relevant for quantizing T5 with GPTQ, I will post an update here.

efrantar mentioned this issue Jul 10, 2023

How to apply 3/4-bit quantization to vision-language model? #28

Closed

fxmarty mentioned this issue Mar 19, 2024

[BUG] torch._C._LinAlgError: linalg.cholesky always raised AutoGPTQ/AutoGPTQ#572

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application to T5 / UL2 family #8

Application to T5 / UL2 family #8

iiLaurens commented Mar 8, 2023

efrantar commented Mar 22, 2023

qwopqwop200 commented Apr 3, 2023

iiLaurens commented Apr 3, 2023

johnrobinsn commented Apr 3, 2023

efrantar commented Apr 3, 2023

johnrobinsn commented Apr 11, 2023

efrantar commented Jul 11, 2023

Application to T5 / UL2 family #8

Application to T5 / UL2 family #8

Comments

iiLaurens commented Mar 8, 2023

efrantar commented Mar 22, 2023

qwopqwop200 commented Apr 3, 2023

iiLaurens commented Apr 3, 2023

johnrobinsn commented Apr 3, 2023

efrantar commented Apr 3, 2023

johnrobinsn commented Apr 11, 2023

efrantar commented Jul 11, 2023