Skip to content

multiple regressions in llama.cpp backend: max_tokens loops in streaming mode, first token duplicated #9716

@lowne-inf

Description

@lowne-inf

I thought that surely by now #9298 would be fixed, so I foolishly decided to click 'reinstall' on my llama.cpp backend to get the latest and greatest. But:

  1. A slightly different first-token-duplicate bug is either still there or newly introduced in the gRPC wrapper - only in streaming mode, and only for Qwen models AFAICT.

  2. When max_tokens is set and the token budget is exhausted during streaming, the endpoint resets and restarts generation rather than terminating with finish_reason: "length". Each restart re-emits a usage chunk with completion_tokens == max_tokens and finish_reason: null, then begins a fresh reasoning block from scratch. This repeats until the endpoint eventually emits finish_reason: "stop" — not a natural model stop from the backend, but apparently the wrapper giving up after several loops. Actual token consumption is therefore a multiple of max_tokens. Again, this is only in streaming mode.

Repro: POST /v1/chat/completions with "stream": true, "max_tokens": 10, (prompt doesn't matter, "count to 5" will do) and a thinking-capable model (e.g. Gemma4 MoE). For bonus points, a Qwen 3.5/3.6 model will also trigger the first bug.

I tried a cursory search in the issues (the state of which is what it is - not surprising given the target crowd) and apparently I'm the only one hitting this, which truly baffles me, especially since it must've been going on for weeks. Also @mudler: are there no basic regression tests for the backend wrappers?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions