IBM's Granite 20B Code Instruct goes off the rails when using either: FP8, Q4 cache or speculative decoding (n-gram). #478

LlamaEnjoyer · 2024-05-30T10:09:53Z

LlamaEnjoyer
May 30, 2024

I tested in EXUI using the ChatML prompt template and the 6.0BPW quant from @turboderp (https://huggingface.co/turboderp/granite-20b-code-instruct-exl2/tree/6.0bpw).

FP16 cache and no SD gives nice and coherent answers. But turn any of the mentioned options on, and it spews out gibberish.

Just wondering if it's normal or some bug in the implementation of that model's support in exllamav2?

turboderp · 2024-05-30T14:15:12Z

turboderp
May 30, 2024
Maintainer

I'll investigate. What are you using as a draft model for SD?

1 reply

LlamaEnjoyer May 30, 2024
Author

No draft model, I just used the (N-gram) option the in EXUI. But speculative decoding is just one way of triggering that issue. Setting cache to either FP8 or Q4 also brakes the model's output.

turboderp · 2024-05-30T19:12:25Z

turboderp
May 30, 2024
Maintainer

Okay, so the Q4 problem is due to the fact that it's an MQA model. That means it has a single key/value head with a dimension of 128, and the Q4 kernels use a block size of 512. This leads to it quantizing the cache in lengths of 128 // 512 == 0 blocks and obviously that doesn't work. I've pushed an update to the dev branch that should address it.

It's somewhat unimportant, though. Because the model uses MQA, the cache size is already extremely small, and because it also uses learned positional embeddings you can't extend the context beyond the native 8k tokens. Still, that part should be okay now.

The other two issues I've been unable to reproduce. n-gram decoding works fine here as far as I can tell. Likewise, FP8 cache uses a block size of 64 which isn't causing issues.

It might be possible that the gibberish is caused by an incorrect prompt format. I never got around to adding the correct prompt template for Granite to ExUI. It's somewhat unusual, and it's hard to say how the model would behave if you use ChatML instead. It seems okay, but it might depend on your prompts. Anyway, I've added the correct format to ExUI just now, so you can try it.

FP16:

FP8:

Q4 (with exllamav2 dev branch):

These are all with n-gram decoding enabled. It also works with smaller Granite models as drafts.

1 reply

LlamaEnjoyer May 30, 2024
Author

Thanks for taking the time to look into it and for the insight. I can confirm that after pulling the latest code for exllamav2 & ExUI the Q4 cache works without any hitches. I too weren't able to reproduce the FP8 cache or SD misbehaviour - I guess it was either an error on my part or Windows/browser were caching some stuff.

Thanks again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IBM's Granite 20B Code Instruct goes off the rails when using either: FP8, Q4 cache or speculative decoding (n-gram). #478

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

IBM's Granite 20B Code Instruct goes off the rails when using either: FP8, Q4 cache or speculative decoding (n-gram). #478

LlamaEnjoyer May 30, 2024

Replies: 2 comments · 2 replies

turboderp May 30, 2024 Maintainer

LlamaEnjoyer May 30, 2024 Author

turboderp May 30, 2024 Maintainer

LlamaEnjoyer May 30, 2024 Author

LlamaEnjoyer
May 30, 2024

Replies: 2 comments 2 replies

turboderp
May 30, 2024
Maintainer

LlamaEnjoyer May 30, 2024
Author

turboderp
May 30, 2024
Maintainer

LlamaEnjoyer May 30, 2024
Author