multiple regressions in llama.cpp backend: `max_tokens` loops in streaming mode, first token duplicated


I thought that surely by now #9298 would be fixed, so I foolishly decided to click 'reinstall' on my llama.cpp backend to get the latest and greatest. But:

1. A slightly different first-token-duplicate bug is either still there or newly introduced in the gRPC wrapper - only in streaming mode, and only for Qwen models AFAICT.

2.  When `max_tokens` is set and the token budget is exhausted during streaming, the endpoint resets and restarts generation rather than terminating with `finish_reason: "length"`. Each restart re-emits a usage chunk with `completion_tokens` == `max_tokens` and `finish_reason: null`, then begins a fresh reasoning block from scratch. This repeats until the endpoint eventually emits `finish_reason: "stop"` — not a natural model stop from the backend, but apparently the wrapper giving up after several loops. Actual token consumption is therefore a multiple of `max_tokens`. Again, this is only in streaming mode. 

Repro: `POST /v1/chat/completions` with `"stream": true, "max_tokens": 10`, (prompt doesn't matter, "count to 5" will do) and a thinking-capable model (e.g. Gemma4 MoE). For bonus points, a Qwen 3.5/3.6 model will also trigger the first bug.

I tried a cursory search in the issues (the state of which is what it is - not surprising given the target crowd)  and apparently I'm the only one hitting this, which truly baffles me, especially since it must've been going on for weeks. Also @mudler: are there no basic regression tests for the backend wrappers?  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multiple regressions in llama.cpp backend: `max_tokens` loops in streaming mode, first token duplicated #9716

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

multiple regressions in llama.cpp backend: max_tokens loops in streaming mode, first token duplicated #9716

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

multiple regressions in llama.cpp backend: `max_tokens` loops in streaming mode, first token duplicated #9716