Skip to content

instrument#20311

Closed
harrisonchu wants to merge 3 commits intoanomalyco:devfrom
harrisonchu:claude/sharp-euler
Closed

instrument#20311
harrisonchu wants to merge 3 commits intoanomalyco:devfrom
harrisonchu:claude/sharp-euler

Conversation

@harrisonchu
Copy link
Copy Markdown

@harrisonchu harrisonchu commented Mar 31, 2026

Exactly — here's the description:


Langfuse Tracing for the OpenCode Agentic Loop

Why this exists

OpenCode's core loop is not a simple request/response — it's a multi-step agent that thinks, calls tools, gets results, thinks again, and repeats until it decides it's done. When you type a message in the TUI, a lot happens that's invisible: multiple LLM calls, chains of tool executions, reasoning traces, token accumulation. This instrumentation makes all of that visible in Langfuse as a structured trace, so you can read a session the same way you'd read source code.


The core loop, explained

The entry point is prompt() in packages/opencode/src/session/prompt.ts, which creates your user message and calls loop(). The loop is a while(true) that keeps running until the model decides it's done:

loop():
  while (true):
    1. Load all messages for the session
    2. Check exit: did the model finish with a non-tool reason? → break
    3. Handle special cases: pending subtasks, context compaction
    4. NORMAL PATH:
       a. Resolve agent + available tools
       b. Create a SessionProcessor
       c. processor.process() → calls LLM.stream() → AI SDK's streamText()
       d. The stream emits events processed in order:
            start-step
              reasoning-start/delta/end   ← "thinking" blocks
              text-start/delta/end        ← response text
              tool-input-start/delta/end  ← tool args streaming in
              tool-call                   ← tool execution begins
              tool-result                 ← tool execution completes
            finish-step                   ← token usage captured here
            (repeat if finish reason = "tool-calls")
       e. result = "continue" | "stop" | "compact"
    5. If finish reason is "tool-calls" → loop again
       Otherwise → break

The key insight: one while iteration ≠ one LLM call. The AI SDK's streamText() handles an internal sub-loop: if the model calls tools, it sends results back and calls the model again — all within one processor.process() call. So one loop step can contain multiple LLM calls chained together.


Why we instrumented where we did

Trace = one loop() invocation. This is the natural unit of a "coding session turn" — one user message through to final response.

loop.step-N span = one while iteration. Each iteration is one attempt to make progress: resolve the agent, call the LLM (possibly multiple times with tool use), and land on a result. Seeing steps lets you understand how many times the agent had to "go back" — e.g. compaction, subtasks, or continuing after tools.

llm-call-N generation = one internal LLM call within a step. This is where the actual model activity is. Each generation captures:

  • Input: what the model received — user message on call 0, tool results on subsequent calls
  • Output: { thinking, text, toolCalls } — exactly what you see in the TUI, now queryable
  • Usage: input/output/reasoning tokens, cache hits, cost per call

This is the most granular and important level. It lets you answer: how much of the token budget is going to reasoning vs. output? Which tool results are being fed back in? Is the model thinking before or after tool calls?

tool.* spans = individual tool executions. These are children of the iteration span, not the generation, because they happen between LLM calls — the model requests them, they execute, and their results become the input to the next LLM call. Seeing them as spans with timing lets you identify which tools are slow or returning large outputs.


The data flow as a trace

trace: opencode.loop  (input = user message)
  loop.step-1
    llm-call-0   input: "fix a simple TODO"
                 output: { thinking: "Let me find TODOs...", toolCalls: [read, read] }
                 usage: 1056 in → 170 out
    tool.read    input: { path: "config.ts" }  output: "file contents..."
    tool.read    input: { path: "bun/index.ts" } output: "..."
    llm-call-1   input: [tool results]
                 output: { thinking: "These reference a Bun issue, let me check...", toolCalls: [webfetch] }
                 usage: 581 in → 253 out
    tool.webfetch ...
    llm-call-2   input: [tool results]
                 output: { thinking: "Issue is resolved, safe to remove", text: "Here's the fix..." }
                 usage: 680 in → 76 out

The rising input token counts across llm-call-N within one step show context accumulation — the model is carrying more and more tool results forward. The reasoning tokens tell you how much of the output budget went to thinking vs. actual response.

harrisonchu and others added 3 commits March 31, 2026 12:03
…/output

- Capture reasoning/thinking content in generation output, matching what
  the TUI displays between tool calls
- Fix generation input/output: step 0 gets the user message, subsequent
  steps get tool results from the previous step as input
- Structure generation output as { thinking, text, toolCalls } so each
  LLM call is fully inspectable in Langfuse
- Also fix kimi-k2p5 TODO in transform.ts (resolved upstream)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the needs:compliance This means the issue will auto-close after 2 hours. label Mar 31, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

This PR doesn't fully meet our contributing guidelines and PR template.

What needs to be fixed:

  • PR description is missing required template sections. Please use the PR template.

Please edit this PR description to address the above within 2 hours, or it will be automatically closed.

If you believe this was flagged incorrectly, please let a maintainer know.

@github-actions
Copy link
Copy Markdown
Contributor

Hey! Your PR title instrument doesn't follow conventional commit format.

Please update it to start with one of:

  • feat: or feat(scope): new feature
  • fix: or fix(scope): bug fix
  • docs: or docs(scope): documentation changes
  • chore: or chore(scope): maintenance tasks
  • refactor: or refactor(scope): code refactoring
  • test: or test(scope): adding or updating tests

Where scope is the package name (e.g., app, desktop, opencode).

See CONTRIBUTING.md for details.

@github-actions
Copy link
Copy Markdown
Contributor

The following comment was made by an LLM, it may be inaccurate:

Based on the search results, I found a potentially related PR:

PR #6629: "feat(telemetry): add OpenTelemetry instrumentation with Aspire Dashboard support"
#6629

This PR appears to be related to instrumentation, specifically for telemetry and OpenTelemetry. Since PR #20311 has the title "instrument" but lacks a detailed description, this existing PR on instrumentation could be a duplicate or related work.

However, the PR description for #20311 is incomplete (just the template with no actual details filled in), so I cannot confirm if they are truly duplicates without more context about what #20311 is trying to accomplish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs:compliance This means the issue will auto-close after 2 hours. needs:title

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant