Conversation
…/output
- Capture reasoning/thinking content in generation output, matching what
the TUI displays between tool calls
- Fix generation input/output: step 0 gets the user message, subsequent
steps get tool results from the previous step as input
- Structure generation output as { thinking, text, toolCalls } so each
LLM call is fully inspectable in Langfuse
- Also fix kimi-k2p5 TODO in transform.ts (resolved upstream)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
This PR doesn't fully meet our contributing guidelines and PR template. What needs to be fixed:
Please edit this PR description to address the above within 2 hours, or it will be automatically closed. If you believe this was flagged incorrectly, please let a maintainer know. |
|
Hey! Your PR title Please update it to start with one of:
Where See CONTRIBUTING.md for details. |
|
The following comment was made by an LLM, it may be inaccurate: Based on the search results, I found a potentially related PR: PR #6629: "feat(telemetry): add OpenTelemetry instrumentation with Aspire Dashboard support" This PR appears to be related to instrumentation, specifically for telemetry and OpenTelemetry. Since PR #20311 has the title "instrument" but lacks a detailed description, this existing PR on instrumentation could be a duplicate or related work. However, the PR description for #20311 is incomplete (just the template with no actual details filled in), so I cannot confirm if they are truly duplicates without more context about what #20311 is trying to accomplish. |
Exactly — here's the description:
Langfuse Tracing for the OpenCode Agentic Loop
Why this exists
OpenCode's core loop is not a simple request/response — it's a multi-step agent that thinks, calls tools, gets results, thinks again, and repeats until it decides it's done. When you type a message in the TUI, a lot happens that's invisible: multiple LLM calls, chains of tool executions, reasoning traces, token accumulation. This instrumentation makes all of that visible in Langfuse as a structured trace, so you can read a session the same way you'd read source code.
The core loop, explained
The entry point is
prompt()inpackages/opencode/src/session/prompt.ts, which creates your user message and callsloop(). The loop is awhile(true)that keeps running until the model decides it's done:The key insight: one
whileiteration ≠ one LLM call. The AI SDK'sstreamText()handles an internal sub-loop: if the model calls tools, it sends results back and calls the model again — all within oneprocessor.process()call. So one loop step can contain multiple LLM calls chained together.Why we instrumented where we did
Trace = one
loop()invocation. This is the natural unit of a "coding session turn" — one user message through to final response.loop.step-Nspan = onewhileiteration. Each iteration is one attempt to make progress: resolve the agent, call the LLM (possibly multiple times with tool use), and land on a result. Seeing steps lets you understand how many times the agent had to "go back" — e.g. compaction, subtasks, or continuing after tools.llm-call-Ngeneration = one internal LLM call within a step. This is where the actual model activity is. Each generation captures:{ thinking, text, toolCalls }— exactly what you see in the TUI, now queryableThis is the most granular and important level. It lets you answer: how much of the token budget is going to reasoning vs. output? Which tool results are being fed back in? Is the model thinking before or after tool calls?
tool.*spans = individual tool executions. These are children of the iteration span, not the generation, because they happen between LLM calls — the model requests them, they execute, and their results become the input to the next LLM call. Seeing them as spans with timing lets you identify which tools are slow or returning large outputs.The data flow as a trace
The rising input token counts across
llm-call-Nwithin one step show context accumulation — the model is carrying more and more tool results forward. The reasoning tokens tell you how much of the output budget went to thinking vs. actual response.