Skip to content

fix(streaming): skip chat deltas for role-init elements to prevent first token duplication#9299

Open
mudler wants to merge 1 commit intomasterfrom
fix/first-token-dup
Open

fix(streaming): skip chat deltas for role-init elements to prevent first token duplication#9299
mudler wants to merge 1 commit intomasterfrom
fix/first-token-dup

Conversation

@mudler
Copy link
Copy Markdown
Owner

@mudler mudler commented Apr 9, 2026

When TASK_RESPONSE_TYPE_OAI_CHAT is used, the first streaming token produces a JSON array with two elements: a role-init chunk and the actual content chunk. The grpc-server loop called attach_chat_deltas for both elements with the same raw_result pointer, stamping the first token's ChatDelta.Content on both replies. The Go side accumulated both, emitting the first content token twice to SSE clients.

Fix: in the array iteration loops in PredictStream, detect role-init elements (delta has "role" key) and skip attach_chat_deltas for them. Only content/reasoning elements get chat deltas attached.

Reasoning models are unaffected because their first token goes into reasoning_content, not content.

Fixes: #9298

@mudler mudler force-pushed the fix/first-token-dup branch from 3eafab5 to a8ad30d Compare April 9, 2026 20:12
@mudler mudler force-pushed the fix/first-token-dup branch from a8ad30d to bcd0d32 Compare April 9, 2026 20:20
@mudler mudler added the bug Something isn't working label Apr 9, 2026
…rst token duplication

When TASK_RESPONSE_TYPE_OAI_CHAT is used, the first streaming token
produces a JSON array with two elements: a role-init chunk and the
actual content chunk. The grpc-server loop called attach_chat_deltas
for both elements with the same raw_result pointer, stamping the first
token's ChatDelta.Content on both replies. The Go side accumulated both,
emitting the first content token twice to SSE clients.

Fix: in the array iteration loops in PredictStream, detect role-init
elements (delta has "role" key) and skip attach_chat_deltas for them.
Only content/reasoning elements get chat deltas attached.

Reasoning models are unaffected because their first token goes into
reasoning_content, not content.
@mudler mudler force-pushed the fix/first-token-dup branch from bcd0d32 to 95c0a5d Compare April 9, 2026 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression: first streaming token duplicated in /v1/chat/completions

1 participant