-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Application to T5 / UL2 family #8
Comments
Hi, in principle, we would expect GPTQ to work on most models. However, applying it to T5 models will require some additional implementation work since these are I think encoder-decoder models, which means that a memory and compute efficient GPTQ implementation (similar to the current one in the repository) would probably require sequentially traversing both the encoder and the decoder branch in parallel. See |
I tried GPTQ quantization of FALN-T5 and it seems to work successfully. |
That's great to hear! Did you need to do anything in particular to get this to work? Did you just run GPTQ on the encoder and decoder seperately, as @efrantar seemed to suggest? |
@qwopqwop200 this is great! Thanks much... I was able to quantize flan-t5-small... but ran into this error when trying to quantize flan-ul2...
this is the command I used...
any ideas? |
This is a numerics error due to a layer-Hessian not being positive-definite, you could try to apply higher dampening |
@efrantar, thanks for your feedback earlier in the thread. I see a fairly significant drop off in performance using an attempt at 4-bit quantization on t5* models using the https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/t5 branch as compared to int8 quantization. Some details of the perf gap are captured here. I am trying to understand your earlier comment in this thread and the quantization code in the referenced repo in the hope of improving the int-4 quant performance on these encoder/decoder models. This repo appears to quantize the encoder layers (layer by layer) and then does the decoder layers (layer by layer) one after the other. You mentioned needing to do these encoder and decoder quantization in parallel (maybe I'm misunderstanding), but can you help me understand this point a bit more and why they might need to be done in parallel? Also, any other insight would be appreciated. |
I had a slightly different encoder-decoder architecture in mind when I suggested the parallel processing of branches; for T5 specifically, quantizing first the encoder and then the decoder should be correct. We are also currently looking at some encoder-decoder models in the context of ongoing research projects; if we find anything that could be relevant for quantizing T5 with GPTQ, I will post an update here. |
Do you expect this to work for T5 architecture (and consequently, the very similar UL2) family? And if not, what do you suspect would be an issue and do you expect that some adjustments need to be made?
Google recently released Flan-UL2 which is 20B parameters in size. GPTQ could be a real life-saver here.
The text was updated successfully, but these errors were encountered: