-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Phi-3 mini 128k performance degradation with kv size > 8k (server) #8995
Comments
single line test prompt: llama-cli -m /data3hd/models/Phi-3-mini-128k-instruct.Q6_K.gguf --color -n -1 --log-disable -ngl 0 -c 4096 -ctk f16 -ctv f16 -b 128 -n 10 --keep 0 --temp 0.0 --dynatemp-range 0.0 --dynatemp-exp 1.0 --top-k 40 --top-p 0.95 --typical 1.0 --min-p 0.00 --repeat-last-n 64 --repeat-penalty 1.0 --presence-penalty 0.0 --frequency-penalty 0.0 --tfs 1.0 --mirostat 0 --mirostat-lr 0.1 --mirostat-ent 5.0 -p "in my palm is a clear stone , and inside it is a small ivory statuette . a guardian angel . `` figured if you 're going to be out at night getting hit by cars , you might as well have some backup . '' i look at him , feeling stunned . like this is some sort of sign . but as i stare at harlin , his mouth curved in a confident grin , i do n't care about" |
Do you experience similar issues using other models? |
Llama 3.1 (also 128k context): 3634/5153 in all cases
nkv=10240, semantically correct answer (though I grade it a mistmach, need to update my grading logic in bench, but the sign . i care about the fact that he nkv=4096, same answer the sign . i care about the fact that he The phi-3 test is using fresh converts after the recent sliding window patch that broke all the phi-3 models. |
I decided to re open this since the problem is still showing up with phi 3.5 mini and someone else also posted a complaint about phi mini so there is almost certainly a bug somewhere in the inference platform when running this model. The symptoms show kv cache > 8192 as the degradation trip point. Both phi 3 mini 128k and phi 3.5 mini can be used with kv cache size up to 8192 with no performance degradation (my current workaround for the problem) but above 8192 performance takes a sharp dive. Phi 3.5 mini, b3609
|
Well phi-3 medium does the same thing. glm-4, internlm, llama3.1, all SOTA high context models are all OK. The difference I believe is LONGROPE which only phi-3 has, from M$:
Notice the 8k in the blurb above. I don't think its a coincidence this is where performance gets munged, though I don't understand why just having NKV bigger than 8k would trigger it. All the LAMBADA prompts are in the range of ~100 tokens and only 3 or 4 output tokens are generated in the test to create the next predicted word. |
Probably related to the logic for choosing long/short rope factors: Lines 9393 to 9406 in 80d9d2a
There might be some issue there, need to compare with the reference code |
Aha. That is definately what is going on here. Its understandable that performance will take a hit with a very long prompts which rely on ROPE to work but for prompts which fit inside the natural sequence length of the model (apparently targeting 8k here) it seems like performance should be unaffected, so it seems like some kind of logic which dynamically selects the long/short freq factors based on the current number of tokens in the KV cache (not its max size) is needed. This code will always penalize performance of the model to the performance of LONG rope just by configuring a KV cache greater than 8192 (apparently the value of hparams.n_ctx_orig_yarn). According to M$ all these original models will be affected:
The new Phi-3.5 mini, vision, and MoE also all use LONGROPE will also be affected. The "original_max_token_embeddings" is 4096 so that might also explain some deviation in performance at the |
Well I dug into the code and dont see a quick fix for this. On the surface a simple hack could be used // if (n_ctx_pre_seq > hparams.n_ctx_orig_yarn) { but this gives the aggregate KV use not the use in the batch to be decoded and also does not account for the tokens which will be added to kv in the decode. The more top down way to handle would be to predict final output kv tokens in llama_decode and filter this info down to the llama_build_graph and then llama_build_phi3 routine as a parameter somehow, however the batching stuff seems to throw a monkey wrench into the works as there doesn't seem to be a way to config uniquely on a per seq. id basis inside the batch, i.e. the llama_build_phi3 is going to cover all seq. ids in the batch no matter how big each individual one is. Hence unless running single slot it seems not possible to adapt the config to the number of final kv tokens for a slot after decode, and it is a complete mess to have the performance potentially varying as a function of how many other unrelated slots are running. Unless anyone else has ideas I think only resolution to this problem is to just document somewhere that for phi-3 models using LONGROPE, that if kv is sized >8192 it is going to be using long rope scaling all the time and performance will be degraded. I am guessing the other models with fixed ROPE scaling are essentially running this |
What happened?
I ran some benches on Phi-3 mini 128k and notice a large performance drop in lambada from 0.618 to 0.496 acc. I traced the problem to increasing the size of the kv cache above 8k with the server (at any value above 8k the acc will drop to 0.494). Performance on other benches is also degraded when kv cache size is above 8k.
Name and Version
b3565
What operating system are you seeing the problem on?
Linux
Relevant log output
The text was updated successfully, but these errors were encountered: