Streaming `/v1/chat/completions` emits the same `tool_call` at multiple `index` values

Here's what I have. I hope it is helpful.

# Streaming `/v1/chat/completions` emits the same `tool_call` at multiple `index` values

**Repo:** github.com/mudler/LocalAI
**Tested commit:** quay.io/go-skynet/local-ai (image `:latest` running 2026-05-07)
**Backend:** llama-cpp on `Orca-Agent-v0.1.Q4_K_M.gguf`

## Summary

When the Hermes/NousResearch-style assistant emits a single `{"name": "...", "arguments": {...}}` JSON tool call, the streaming response delivers that single logical call as **multiple `tool_calls` deltas at different `index` values** (typically 3 unique indices, 5 total deltas). Non-streaming requests return exactly one `tool_call`.

OpenAI streaming clients that key tool-call accumulation by `index` (such as [`mudler/cogito`](https://github.com/mudler/cogito), used by LocalAGI) end up with N duplicate `ToolChoice` entries and dispatch the tool N times.

## Repro

```bash
curl -s -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "orca-agent-v0.1",
    "messages": [
      {"role":"system","content":"You are a function-calling assistant. Call exactly one tool."},
      {"role":"user","content":"Run ls /tmp | wc -l and tell me the count."}
    ],
    "tools": [
      {"type":"function","function":{"name":"bash","description":"Run a bash script","parameters":{"type":"object","properties":{"script":{"type":"string"}},"required":["script"]}}}
    ],
    "stream": true
  }' 2>&1 | grep -oE '"index":[0-9]+' | sort -u
```

Output:

```
"index":0
"index":1
"index":2
```

Same call with `"stream": false` returns exactly one `tool_call` at `index: 0`.

## Observed streaming pattern

```
delta: index=1, name=bash, arguments={"script":"..."}      ← emitted first (full)
delta: index=0, name=bash, arguments=                      ← emitted second (empty args)
delta: index=0, arguments={"script":"..."}                 ← emitted third (continuation of index 0)
```

Three deltas, two of which are clearly the same logical call streamed in pieces at index 0, plus a redundant full emission at index 1. With longer prompt context, a third index 2 also appears.

## Suspected cause

`core/http/endpoints/openai/chat.go` runs two tool-call parsers concurrently in the streaming loop:

- The **C++ chat-template autoparser** delivers `chatDeltas` with pre-parsed `tool_calls` (this looks like the index-0 stream).
- The **Go iterative JSON parser** (`functions.ParseJSONIterative`) also fires on `cleanedResult` — when content tokens accumulate enough JSON to parse, it emits *another* `tool_call` delta starting at `Index: lastEmittedCount`.

`lastEmittedCount` only guards the Go-parser path — it doesn't track what the C++ autoparser already streamed via `chatDeltas`. The two paths therefore double-emit the same logical call at different indices.

The deferred flush in `chat_emit.go` (`buildDeferredToolCallChunks`) does check `lastEmittedCount`, but `functionResults` at that point is sourced from EITHER `ToolCallsFromChatDeltas(chatDeltas)` OR `ParseFunctionCall(cleanedResult, ...)` — not deduped against what the streaming Go path already emitted. So a final flush can land on yet another index.

## Knobs tried (all no-ops for this issue)

In `function:` block of model YAML:

- `disable_peg_parser: true` — same 3 indices / 5 deltas (or worse, with the legacy iterative parser finding more matches).
- `grammar.no_mixed_free_string: true` — no change.
- `parallel_tool_calls: false` in the request body — silently ignored (`/v1/chat/completions` doesn't honor it; only `/v1/responses` reads `parallel_tool_calls`).

## Suggested fixes

Pick whichever fits the architecture best:

1. **Track C++ autoparser emissions in `lastEmittedCount`** so the Go iterative parser doesn't double up.
2. **Pick one parser per request** based on whether `chatDeltas` is non-empty — if the C++ autoparser is producing tool calls, skip the Go iterative parser entirely (and vice versa). This is already the pattern used in the deferred flush at `chat.go:366`; it just needs to apply to the streaming loop too.
3. **Dedupe `functionResults`** by `(name, arguments)` before the deferred flush, as a defensive net.

Happy to send a PR if a maintainer can confirm the preferred direction.

## Workaround

Disable streaming on the client side. cogito falls back to non-streaming when `streamCallback` is `nil` and the response is clean.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Streaming `/v1/chat/completions` emits the same `tool_call` at multiple `index` values #9722

Streaming `/v1/chat/completions` emits the same `tool_call` at multiple `index` values

Summary

Repro

Observed streaming pattern

Suspected cause

Knobs tried (all no-ops for this issue)

Suggested fixes

Workaround

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Streaming /v1/chat/completions emits the same tool_call at multiple index values #9722

Description

Streaming /v1/chat/completions emits the same tool_call at multiple index values

Summary

Repro

Observed streaming pattern

Suspected cause

Knobs tried (all no-ops for this issue)

Suggested fixes

Workaround

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Streaming `/v1/chat/completions` emits the same `tool_call` at multiple `index` values #9722

Streaming `/v1/chat/completions` emits the same `tool_call` at multiple `index` values