Here's what I have. I hope it is helpful.
Streaming /v1/chat/completions emits the same tool_call at multiple index values
Repo: github.com/mudler/LocalAI
Tested commit: quay.io/go-skynet/local-ai (image :latest running 2026-05-07)
Backend: llama-cpp on Orca-Agent-v0.1.Q4_K_M.gguf
Summary
When the Hermes/NousResearch-style assistant emits a single {"name": "...", "arguments": {...}} JSON tool call, the streaming response delivers that single logical call as multiple tool_calls deltas at different index values (typically 3 unique indices, 5 total deltas). Non-streaming requests return exactly one tool_call.
OpenAI streaming clients that key tool-call accumulation by index (such as mudler/cogito, used by LocalAGI) end up with N duplicate ToolChoice entries and dispatch the tool N times.
Repro
curl -s -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "orca-agent-v0.1",
"messages": [
{"role":"system","content":"You are a function-calling assistant. Call exactly one tool."},
{"role":"user","content":"Run ls /tmp | wc -l and tell me the count."}
],
"tools": [
{"type":"function","function":{"name":"bash","description":"Run a bash script","parameters":{"type":"object","properties":{"script":{"type":"string"}},"required":["script"]}}}
],
"stream": true
}' 2>&1 | grep -oE '"index":[0-9]+' | sort -u
Output:
"index":0
"index":1
"index":2
Same call with "stream": false returns exactly one tool_call at index: 0.
Observed streaming pattern
delta: index=1, name=bash, arguments={"script":"..."} ← emitted first (full)
delta: index=0, name=bash, arguments= ← emitted second (empty args)
delta: index=0, arguments={"script":"..."} ← emitted third (continuation of index 0)
Three deltas, two of which are clearly the same logical call streamed in pieces at index 0, plus a redundant full emission at index 1. With longer prompt context, a third index 2 also appears.
Suspected cause
core/http/endpoints/openai/chat.go runs two tool-call parsers concurrently in the streaming loop:
- The C++ chat-template autoparser delivers
chatDeltas with pre-parsed tool_calls (this looks like the index-0 stream).
- The Go iterative JSON parser (
functions.ParseJSONIterative) also fires on cleanedResult — when content tokens accumulate enough JSON to parse, it emits another tool_call delta starting at Index: lastEmittedCount.
lastEmittedCount only guards the Go-parser path — it doesn't track what the C++ autoparser already streamed via chatDeltas. The two paths therefore double-emit the same logical call at different indices.
The deferred flush in chat_emit.go (buildDeferredToolCallChunks) does check lastEmittedCount, but functionResults at that point is sourced from EITHER ToolCallsFromChatDeltas(chatDeltas) OR ParseFunctionCall(cleanedResult, ...) — not deduped against what the streaming Go path already emitted. So a final flush can land on yet another index.
Knobs tried (all no-ops for this issue)
In function: block of model YAML:
disable_peg_parser: true — same 3 indices / 5 deltas (or worse, with the legacy iterative parser finding more matches).
grammar.no_mixed_free_string: true — no change.
parallel_tool_calls: false in the request body — silently ignored (/v1/chat/completions doesn't honor it; only /v1/responses reads parallel_tool_calls).
Suggested fixes
Pick whichever fits the architecture best:
- Track C++ autoparser emissions in
lastEmittedCount so the Go iterative parser doesn't double up.
- Pick one parser per request based on whether
chatDeltas is non-empty — if the C++ autoparser is producing tool calls, skip the Go iterative parser entirely (and vice versa). This is already the pattern used in the deferred flush at chat.go:366; it just needs to apply to the streaming loop too.
- Dedupe
functionResults by (name, arguments) before the deferred flush, as a defensive net.
Happy to send a PR if a maintainer can confirm the preferred direction.
Workaround
Disable streaming on the client side. cogito falls back to non-streaming when streamCallback is nil and the response is clean.
Here's what I have. I hope it is helpful.
Streaming
/v1/chat/completionsemits the sametool_callat multipleindexvaluesRepo: github.com/mudler/LocalAI
Tested commit: quay.io/go-skynet/local-ai (image
:latestrunning 2026-05-07)Backend: llama-cpp on
Orca-Agent-v0.1.Q4_K_M.ggufSummary
When the Hermes/NousResearch-style assistant emits a single
{"name": "...", "arguments": {...}}JSON tool call, the streaming response delivers that single logical call as multipletool_callsdeltas at differentindexvalues (typically 3 unique indices, 5 total deltas). Non-streaming requests return exactly onetool_call.OpenAI streaming clients that key tool-call accumulation by
index(such asmudler/cogito, used by LocalAGI) end up with N duplicateToolChoiceentries and dispatch the tool N times.Repro
Output:
Same call with
"stream": falsereturns exactly onetool_callatindex: 0.Observed streaming pattern
Three deltas, two of which are clearly the same logical call streamed in pieces at index 0, plus a redundant full emission at index 1. With longer prompt context, a third index 2 also appears.
Suspected cause
core/http/endpoints/openai/chat.goruns two tool-call parsers concurrently in the streaming loop:chatDeltaswith pre-parsedtool_calls(this looks like the index-0 stream).functions.ParseJSONIterative) also fires oncleanedResult— when content tokens accumulate enough JSON to parse, it emits anothertool_calldelta starting atIndex: lastEmittedCount.lastEmittedCountonly guards the Go-parser path — it doesn't track what the C++ autoparser already streamed viachatDeltas. The two paths therefore double-emit the same logical call at different indices.The deferred flush in
chat_emit.go(buildDeferredToolCallChunks) does checklastEmittedCount, butfunctionResultsat that point is sourced from EITHERToolCallsFromChatDeltas(chatDeltas)ORParseFunctionCall(cleanedResult, ...)— not deduped against what the streaming Go path already emitted. So a final flush can land on yet another index.Knobs tried (all no-ops for this issue)
In
function:block of model YAML:disable_peg_parser: true— same 3 indices / 5 deltas (or worse, with the legacy iterative parser finding more matches).grammar.no_mixed_free_string: true— no change.parallel_tool_calls: falsein the request body — silently ignored (/v1/chat/completionsdoesn't honor it; only/v1/responsesreadsparallel_tool_calls).Suggested fixes
Pick whichever fits the architecture best:
lastEmittedCountso the Go iterative parser doesn't double up.chatDeltasis non-empty — if the C++ autoparser is producing tool calls, skip the Go iterative parser entirely (and vice versa). This is already the pattern used in the deferred flush atchat.go:366; it just needs to apply to the streaming loop too.functionResultsby(name, arguments)before the deferred flush, as a defensive net.Happy to send a PR if a maintainer can confirm the preferred direction.
Workaround
Disable streaming on the client side. cogito falls back to non-streaming when
streamCallbackisniland the response is clean.