Skip to content

Streaming /v1/chat/completions emits the same tool_call at multiple index values #9722

@matthewhelmke

Description

@matthewhelmke

Here's what I have. I hope it is helpful.

Streaming /v1/chat/completions emits the same tool_call at multiple index values

Repo: github.com/mudler/LocalAI
Tested commit: quay.io/go-skynet/local-ai (image :latest running 2026-05-07)
Backend: llama-cpp on Orca-Agent-v0.1.Q4_K_M.gguf

Summary

When the Hermes/NousResearch-style assistant emits a single {"name": "...", "arguments": {...}} JSON tool call, the streaming response delivers that single logical call as multiple tool_calls deltas at different index values (typically 3 unique indices, 5 total deltas). Non-streaming requests return exactly one tool_call.

OpenAI streaming clients that key tool-call accumulation by index (such as mudler/cogito, used by LocalAGI) end up with N duplicate ToolChoice entries and dispatch the tool N times.

Repro

curl -s -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "orca-agent-v0.1",
    "messages": [
      {"role":"system","content":"You are a function-calling assistant. Call exactly one tool."},
      {"role":"user","content":"Run ls /tmp | wc -l and tell me the count."}
    ],
    "tools": [
      {"type":"function","function":{"name":"bash","description":"Run a bash script","parameters":{"type":"object","properties":{"script":{"type":"string"}},"required":["script"]}}}
    ],
    "stream": true
  }' 2>&1 | grep -oE '"index":[0-9]+' | sort -u

Output:

"index":0
"index":1
"index":2

Same call with "stream": false returns exactly one tool_call at index: 0.

Observed streaming pattern

delta: index=1, name=bash, arguments={"script":"..."}      ← emitted first (full)
delta: index=0, name=bash, arguments=                      ← emitted second (empty args)
delta: index=0, arguments={"script":"..."}                 ← emitted third (continuation of index 0)

Three deltas, two of which are clearly the same logical call streamed in pieces at index 0, plus a redundant full emission at index 1. With longer prompt context, a third index 2 also appears.

Suspected cause

core/http/endpoints/openai/chat.go runs two tool-call parsers concurrently in the streaming loop:

  • The C++ chat-template autoparser delivers chatDeltas with pre-parsed tool_calls (this looks like the index-0 stream).
  • The Go iterative JSON parser (functions.ParseJSONIterative) also fires on cleanedResult — when content tokens accumulate enough JSON to parse, it emits another tool_call delta starting at Index: lastEmittedCount.

lastEmittedCount only guards the Go-parser path — it doesn't track what the C++ autoparser already streamed via chatDeltas. The two paths therefore double-emit the same logical call at different indices.

The deferred flush in chat_emit.go (buildDeferredToolCallChunks) does check lastEmittedCount, but functionResults at that point is sourced from EITHER ToolCallsFromChatDeltas(chatDeltas) OR ParseFunctionCall(cleanedResult, ...) — not deduped against what the streaming Go path already emitted. So a final flush can land on yet another index.

Knobs tried (all no-ops for this issue)

In function: block of model YAML:

  • disable_peg_parser: true — same 3 indices / 5 deltas (or worse, with the legacy iterative parser finding more matches).
  • grammar.no_mixed_free_string: true — no change.
  • parallel_tool_calls: false in the request body — silently ignored (/v1/chat/completions doesn't honor it; only /v1/responses reads parallel_tool_calls).

Suggested fixes

Pick whichever fits the architecture best:

  1. Track C++ autoparser emissions in lastEmittedCount so the Go iterative parser doesn't double up.
  2. Pick one parser per request based on whether chatDeltas is non-empty — if the C++ autoparser is producing tool calls, skip the Go iterative parser entirely (and vice versa). This is already the pattern used in the deferred flush at chat.go:366; it just needs to apply to the streaming loop too.
  3. Dedupe functionResults by (name, arguments) before the deferred flush, as a defensive net.

Happy to send a PR if a maintainer can confirm the preferred direction.

Workaround

Disable streaming on the client side. cogito falls back to non-streaming when streamCallback is nil and the response is clean.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions