I thought that surely by now #9298 would be fixed, so I foolishly decided to click 'reinstall' on my llama.cpp backend to get the latest and greatest. But:
-
A slightly different first-token-duplicate bug is either still there or newly introduced in the gRPC wrapper - only in streaming mode, and only for Qwen models AFAICT.
-
When max_tokens is set and the token budget is exhausted during streaming, the endpoint resets and restarts generation rather than terminating with finish_reason: "length". Each restart re-emits a usage chunk with completion_tokens == max_tokens and finish_reason: null, then begins a fresh reasoning block from scratch. This repeats until the endpoint eventually emits finish_reason: "stop" — not a natural model stop from the backend, but apparently the wrapper giving up after several loops. Actual token consumption is therefore a multiple of max_tokens. Again, this is only in streaming mode.
Repro: POST /v1/chat/completions with "stream": true, "max_tokens": 10, (prompt doesn't matter, "count to 5" will do) and a thinking-capable model (e.g. Gemma4 MoE). For bonus points, a Qwen 3.5/3.6 model will also trigger the first bug.
I tried a cursory search in the issues (the state of which is what it is - not surprising given the target crowd) and apparently I'm the only one hitting this, which truly baffles me, especially since it must've been going on for weeks. Also @mudler: are there no basic regression tests for the backend wrappers?
I thought that surely by now #9298 would be fixed, so I foolishly decided to click 'reinstall' on my llama.cpp backend to get the latest and greatest. But:
A slightly different first-token-duplicate bug is either still there or newly introduced in the gRPC wrapper - only in streaming mode, and only for Qwen models AFAICT.
When
max_tokensis set and the token budget is exhausted during streaming, the endpoint resets and restarts generation rather than terminating withfinish_reason: "length". Each restart re-emits a usage chunk withcompletion_tokens==max_tokensandfinish_reason: null, then begins a fresh reasoning block from scratch. This repeats until the endpoint eventually emitsfinish_reason: "stop"— not a natural model stop from the backend, but apparently the wrapper giving up after several loops. Actual token consumption is therefore a multiple ofmax_tokens. Again, this is only in streaming mode.Repro:
POST /v1/chat/completionswith"stream": true, "max_tokens": 10, (prompt doesn't matter, "count to 5" will do) and a thinking-capable model (e.g. Gemma4 MoE). For bonus points, a Qwen 3.5/3.6 model will also trigger the first bug.I tried a cursory search in the issues (the state of which is what it is - not surprising given the target crowd) and apparently I'm the only one hitting this, which truly baffles me, especially since it must've been going on for weeks. Also @mudler: are there no basic regression tests for the backend wrappers?