IBM's Granite 20B Code Instruct goes off the rails when using either: FP8, Q4 cache or speculative decoding (n-gram). #478
Replies: 2 comments 2 replies
-
I'll investigate. What are you using as a draft model for SD? |
Beta Was this translation helpful? Give feedback.
-
Okay, so the Q4 problem is due to the fact that it's an MQA model. That means it has a single key/value head with a dimension of 128, and the Q4 kernels use a block size of 512. This leads to it quantizing the cache in lengths of It's somewhat unimportant, though. Because the model uses MQA, the cache size is already extremely small, and because it also uses learned positional embeddings you can't extend the context beyond the native 8k tokens. Still, that part should be okay now. The other two issues I've been unable to reproduce. n-gram decoding works fine here as far as I can tell. Likewise, FP8 cache uses a block size of 64 which isn't causing issues. It might be possible that the gibberish is caused by an incorrect prompt format. I never got around to adding the correct prompt template for Granite to ExUI. It's somewhat unusual, and it's hard to say how the model would behave if you use ChatML instead. It seems okay, but it might depend on your prompts. Anyway, I've added the correct format to ExUI just now, so you can try it. FP16: FP8: Q4 (with exllamav2 dev branch): These are all with n-gram decoding enabled. It also works with smaller Granite models as drafts. |
Beta Was this translation helpful? Give feedback.
-
I tested in EXUI using the ChatML prompt template and the 6.0BPW quant from @turboderp (https://huggingface.co/turboderp/granite-20b-code-instruct-exl2/tree/6.0bpw).
FP16 cache and no SD gives nice and coherent answers. But turn any of the mentioned options on, and it spews out gibberish.
Just wondering if it's normal or some bug in the implementation of that model's support in exllamav2?
Beta Was this translation helpful? Give feedback.
All reactions