-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ContextShift sometimes degrades output #550
Comments
I usually run into this sort of thing in ST +KCPP when using functions like "Continue" twice on the same message, or editing semi-recent messages to take an alternate plot path or something like that. It's happened far less after the recent KV cache changes though, and it never seems to happen under normal use, only when I start getting ballsy with mass edits and weird 3rd party functions. |
May be worth to keep an eye out on the number of tokens contextshift has evicted. Also you should compare for cases where contextshift is disabled. Are you running 1.51.1? |
Yes, sorry I just wrote the commit ref. That is 1.51.1. I pull like once per week. Don't want to miss out on the nice stuff you implement ;-)
Sure, I need a more methodical approach to dig down further. Why do you say I should keep an eye out on the numbers? What am I looking for? And does it matter if I let it evict 100 tokens or 400? And is there a difference between evicting 200 tokens once, or 50 tokens four times in a row? I suppose I use that feature like most people would, to "stream" longer output and re-use the KV cache. I've read the StreamingLLM paper I suppose "ContextShift" plays out a bit differently for that use-case? How does this feature compare to their findings? Judging by KoboldCpp's debug output I'd say you don't keep 4 "attention sinks" around? It tells me it evicts starting at token 2. Edit: I had a quick look at the code. Seems the implementation just shifts around the sequence numbers in the cache and doesn't touch any values. Doesn't that practically boil down to window attention (without re-computation) from the Streaming-LLM paper? ( mit-han-lab/streaming-llm#33 ) |
This is the default tokenizer used in Llama.cpp being shite and broken. Something about the implementation affects thing outside of just tokenization. Using GPT-2 or NAI through ST resolves this, but often breaks context shifting. I have brought this up many times privately with lostruins, but pinpointing the exact issue is a bit hard. Try using a different tokenizer and it should resolve the issues. In my case, it starts removing the word "the" and shortening sentences a lot, becoming very blunt. TL;DR, don't use Lite, don't use auto/api tokenizer. |
@LostRuins, can Lite (optionally?) not strip the text on itself, always sending the whole history to the backend? |
That has also happened to me.
What kind of magic does the tokenizer do? I thought tokinizing was a very straightforward operation?
I don't think that works. The context window has a fixed size, you can't fit more than that into it. |
Koboldcpp backend should do the trimming (token-wise), not Lite client (by words). More discussions: #445 (comment) If the problem persists, we would know that it's not Lite's fault, but either a bug in koboldcpp or in llama.cpp upstream library. |
@aleksusklim there is nothing wrong with the tokenizer, as far as I can tell. You can view the tokens in context with |
I'm sure this is not the frontend. I found a similar bugreport in llama.cpp: ggerganov#4097 |
Found this thread earlier while looking at the same issue in llamacpp. If I do any long chat conversation with context shifting it does indeed end up repeating endlessly. However, it doesn't seem to be entirely inherent to the context shifting itself (although maybe just partially, since there's more garbage looping into the calculations already). If I re-run the same prompt tokens as-is after they were shifted in a blank slot, the output is already quite degraded as well. To me it seems to be mostly a matter of excessive repeated use of tokens from the output going into a feedback loop. I'm not sure about increasing token penalties, since that affects other important tokens as well. I solved it in my application by just tracking the last 18 sentences or lines that were written by the model. Then whenever a new complete sentence is generated, I check if it has at least 50% fresh tokens compared to each of those 18 lines individually. (So, if any of the last 18 sentences is more than 50% the same as the new one, the new line gets rejected.) This seems to work well so far in my testing, without affecting prompt-related tokens. Output after many context shifts is staying very coherent with that filter in place. Even works well for long generation sequences with no user input. |
Not all windowing methods are able to prevent degradation of perplexity. But results from attention sinks (StreamingLLM) method look pretty nice - it looks like worth implementing: I've found open issue for implementing this in llama.cpp, here: ggerganov#3440. |
llama.cpp has implemented this now. I suppose it's just leaving the first n tokens (4) in place when shifting all the context. |
I am not sure if it would work exactly as in the paper (it would required digging into original and transfromers implementations, then comparing it with how it's done in llama.cpp, then running something like 1M tokens test), but yeah, using 4 or 8 initial tokens is part of this method. To work as described in the paper setting up an attention sink(s) (n_discard) is also required - but I don't see this option available as launch parameter for main or server. |
Aren't the Memory dedicated field is doing the work? Just put there four newlines and you're safe. In case you have an actual memory content (which you already should, because otherwise how do you think the model would behave regarding its system prompt?), everything will work as-is in koboldcpp. |
Sure, I've had disagreements about that paper before. I think just the four tokens will solve this problem though. The rest of StreamingLLM is probably never going to get implemented. They closed the issue in llama.cpp last week. And there doesn't seem much interest in this method in general.
Thanks. That's a good idea and somehow skipped my mind. However it'd probably be nice to have this as a setting which is enabled per default if there isn't anything pinned to the first tokens. As I think this is kind of a not obvious workaround. And messing with the first tokens probably never works well?! |
This is much more straightforward than pinning random empty tokens at the beginning of the context!
Or, for roleplay, you get:
Do you really think just "keeping 4 tokens from the start" would not cause degradation of quality? I believe for question-answering you should pin either just the system prompt, or system + a few on-topic question-answer pairs. (In case of empty system prompt – then yes, you'll need at least 4 of "something" there…) Personally, I did exactly this just when ContextShift came out, and it worked well (except for accidental re-evaluations). Now I use only miqu/mixtral, and the context at its max of 64k is more than enough for any possible application for me! (And probably could be further extended to 128k right away by lifting that artificial limit) |
To verify if context shifting works properly it would be the best to add a proper test case for it - like 1M+ token conversation (StreamingLLM w/ Attention Sinks can do that - tested by MIT up to 4M+ conversations). If implemented right models should be able to continue reasonably predict tokens (low and steady perplexity, memory usage, and compute). It should work even with small contexts like 4k, a bit like we talk as humans (we don't remember every single word, but rather focus our attention on most relevant information - and continuously discard the rest of it as conversations get longer). |
The last time I tried to make an external test case it failed because of regenerations: Maybe it's worth checking it again, but as I said – I don't see reasons to use context shifting anymore because now we have models with very long contexts. |
Those are not mutually exclusive things. IMO we should have both large contexts, AND being able to have infinite conversations - like we have sometimes have with other humans, where single turn of conversation could be a pretty long letter (like 10-20 pages of text). Currently it feels really bad when you have long and interesting conversation, and then model collapses (predicts crap tokens, stucks and repeats, etc.). Please also mind a lot of people use consumer grade cards, with just 16GB of VRAM, or even less. Which means all they can use are usually quantized 7B to 13B models, with much smaller contexts (8k to 32k for finetunes of Mistral). |
You mean because of broken shifting or without shifting at all?
I do not use GPU offloading anymore: #737 (comment) Given speeds like that (1-2 tokens per second for 64k with largest Mixtral, slowing further as you fill the context) it is unfeasible to test context shifting for real. Without an ability to save and load contexts at will – it would be hard to pinpoint any found bugs though, because you might not reproduce them again in the next run. |
Yes, I think that's one of the main findings of the Streaming-LLM paper. Sure. The context get's shifted out and dropped and becomes unavailable to the model. But it were nice if the model continued to generate legible text.
I'd like to have a chatbot and continue talking to it. And do storywriting (novels). Even 16k (or 32k) is finite and I hit that at some point. ContextShift is super useful for that. Also I don't have an infinite amount of RAM for a super large KV cache. I myself wouldn't want to return to the times where each time I hit the context limit I'd need to wait several minutes for each subsequent reply. |
I'm going to close this now. I'm not sure if it's solved, but I didn't encounter this bug for quite some time. Either it's become better or my usage pattern has changed. Anyways, thanks for the great software. |
I'm trying storywriting with KoboldCpp. At some point the story will get longer than the context and KoboldCpp starts evicting tokens from the beginning, with the (newer) ContextShift feature. Sometimes this degrades the output significantly. It will get into repetition-loops, barely write correct sentences and forget who is doing what. That happens after the KV cache got messed with. The story had been fine for the first 4k tokens (or so, dependant on context size) before.
Does this also happen to other people? I'm not sure if it does this every time. I'm pretty sure it doesn't always happen. Other times it keeps generating high-quality output. I've changed too many settings simultaneously and tried different models, so I can't really pin it down and make a good statement.
I'm not sure what I'm doing wrong or if this is a bug. I have also set RoPE scaling and I regularly edit the last paragraphs before (re)generating more output.
(I really like the speedup with ContextShift, so disabling it won't be an option.)
Environment and Context
Platform: Linux (Debian), CPU only
KoboldCpp: On branch concedo, commit 0ca814e
Steps to Reproduce
Not sure. I'd like to hear other people's experiences. Generate long output, past the (initial) context window. Keep generating more and more text.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
The text was updated successfully, but these errors were encountered: