Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Exclude Top Choice (XTC) sampler #625

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

Cyrus-Hei
Copy link

A crude Python implementation of the XTC sampler introduced in oobabooga/text-generation-webui PR #6335.

This is the version which eos and newline tokens are excluded from the sampler. From my brief testing the sampler should be working, but it seems to slow down generation speed heavily when the sampler is activated, not sure how to fix that at the moment.

@baronrabban
Copy link

I am curious what type of slowdown you are seeing. I am running tensor parallel with multiple GPUs.

Cold evaluation with your XTC change:

(Queue: 0.0 s, Process: 0 cached tokens and 14038 new tokens at 618.25 T/s, Generate: 13.02 T/s, Context: 14038 tokens)

Two cold evaluations without your change:

(Queue: 0.0 s, Process: 0 cached tokens and 14038 new tokens at 623.05 T/s, Generate: 13.18 T/s, Context: 14038 tokens)

(Queue: 0.0 s, Process: 0 cached tokens and 14038 new tokens at 622.48 T/s, Generate: 13.22 T/s, Context: 14038 tokens)

So the slowdown was either 0.16 or 0.20 but I would not describe this as a heavy slowdown. Also, I think your XTC change is working and producing similar results to those I saw testing XTC in kobold.

@turboderp
Copy link
Owner

You're not going to see much of a slowdown when the baseline is 13.2 t/s. It's an extra 1.2 ms/token of latency which would definitely be felt for smaller models. The right place to apply this would be right before the multinomial, just by scaling the sampling interval from 0..1 to x..1 where x is either a constant or adjusted based on what the top token is if (as seems to be the trend with these methods) you have to make exceptions to avoid completely breaking the model.

Why not use skew sampling, though? It's a very similar idea, only it's a smoother function.

XTC:

image

Skew:

image

@Cyrus-Hei
Copy link
Author

Different models reacts differently as far as I can tell, probably related to vocab size and how the implementation checks logits. At xtc probability 0.8, I noticed a drop of 31 T/s to 24 T/s for a Gemma 2 27B finetune at 5 bpw. And a drop of 38 T/s to 31 T/s for a Mistral Nemo upscale finetune (Theia 21B) at 6 bpw. On the other hand, I noticed only a drop of 36 T/s to 35 T/s on Mistral small Instruct 2409 6.5 bpw. These are all carried out on a single RTX 3090 at around 5k/16k context.

As a side note, I can't really tell if XTC is actually working to be honest, I cannot tell if the generation difference is from temperature or XTC (or my settings are just off for the models I am testing). And I am very much skeptical about using a probability to determine if the sampler should activate or not.

As for the adding exceptions part, I am also not sure if that is required, as the idea is to remove the set of tokens above the threshold, except the least probable one in the set. I would suppose if the reply should end, the EOS token would have a very high probability, hence likely making itself the only token above the threshold (thus keeping it). The exception is added only to prevent the breaking of larger models (70B+) as mentioned in the original PR, and the discussion is still ongoing on their side I suspect. My experience is only with smaller models (30B-), and they have been working completely fine without handling EOS as an exception, in both this implementation and that of KoboldCPP.

For the skew sampler, I want to suggest adding more documentations for it, I am as lost I could trying to figure out what values to put. And if I have to say what makes XTC better (or worse), is XTC's easier-to-understand parameters and its more rigorous attempt at completely eliminating top tokens when it activates, which forces the LLM harder to use more "creative" tokens.

@baronrabban
Copy link

At xtc probability 0.8

How did you select this? I believe the default is 0.5 which means XTC kicks in 50% of the time. At 0.8 I believe you're using XTC 80% of the time which perhaps leads to more of a slowdown than using it 50% of the time.

I am using a variant of Mistral Large, taking up 100GB VRAM. 0.8 temperature, 0.02 min P and all other samplers disabled. The results are quite good and there are no quirks. I put prints on the EOS/Newline section and it is definitely kicking in at the end of generation.

I can tell XTC is working, besides print statements, given the way it changes the story. I have some scenarios where I know how it's supposed to go and it goes that way pretty much every time but with XTC it's definitely changing things up in a good way.

@Cyrus-Hei
Copy link
Author

At 0.8 I believe you're using XTC 80% of the time which perhaps leads to more of a slowdown than using it 50% of the time.

It is more of an experimental value for testing purposes, and you are absolutely right that 0.8 slows down generation more than 0.5. The point I want to make is that this implementation would drop generation speed significantly on some models and settings, for example if a model suffers a 20% slow down at 0.8 XTC probability, it is safe to expect it to be slowed down around 10% at 0.5 XTC probability.

I would say this implementation is more for testing only, if more data supports the effectiveness of XTC, I might look into doing it in C++ if I have time, or someone else could do a new PR.

@aarongerber
Copy link

aarongerber commented Sep 19, 2024

Why not use skew sampling, though? It's a very similar idea, only it's a smoother function.
@turboderp Did you ask p-e-w on the original implementation or get an answer on this? My first thought is that it might allow for too many of the top choices. Still, I would love to see you explore this! :) If I had half your brains / knowledge on coding or math I would. The visual looks compelling. I am assuming it wouldn't eliminate the top choices but decrease their probability? It would be interesting if you could set the width/spread/range of the curve, and it's positioning. This would let you raise or lower the probability of top tokens two ways. Perhaps unneeded control? Anyway, thanks for all you do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants