Skip to content

Support OpenAI Responses API and enrich OpenAI v2 GenAI telemetry #209

@sipercai

Description

@sipercai

Summary

LoongSuite currently provides OpenAI v2 instrumentation for Chat Completions and Embeddings, but the latest OpenAI Python SDK exposes additional API surfaces that are important for modern GenAI workloads. The largest gap is the OpenAI Responses API, which is now the primary model interaction API in the OpenAI SDK and is also used by agentic workflows.

This issue tracks adding OpenAI Responses API instrumentation and enriching OpenAI v2 telemetry while keeping the implementation aligned with LoongSuite's existing opentelemetry.util.genai helpers and GenAI semantic-convention behavior.

Current coverage

The current opentelemetry-instrumentation-openai-v2 instrumentation wraps:

  • openai.resources.chat.completions.Completions.create
  • openai.resources.chat.completions.AsyncCompletions.create
  • openai.resources.embeddings.Embeddings.create
  • openai.resources.embeddings.AsyncEmbeddings.create

Chat Completions already has sync, async, streaming, raw-response, tool-call, content-capture, metrics, and error-path test coverage. Embeddings is covered for sync/async calls, token metrics, dimensions, encoding format, and error paths, but it still uses a direct tracer/metrics path instead of the newer TelemetryHandler flow used by Chat Completions in experimental semconv mode.

Gaps

P0: Responses API is not instrumented

The OpenAI Python SDK exposes client.responses.create, client.responses.stream, async variants, and related streaming events. LoongSuite currently does not wrap openai.resources.responses, so calls through the Responses API do not produce OpenAI GenAI spans, events, or token metrics.

The new instrumentation should cover:

  • sync Responses.create
  • async AsyncResponses.create
  • stream=True
  • Responses.stream / async streaming helpers
  • raw-response parsing and context-manager usage
  • success, incomplete, failed, cancelled, and exception paths

P0: Responses streaming needs a dedicated accumulator

Responses streaming emits different events from Chat Completions streaming. The instrumentation should aggregate final response state from completion/done events, preserve context for sync and async iteration, and end spans reliably when streams are exhausted, closed, or fail.

P1: Reuse opentelemetry.util.genai consistently

The Responses API implementation should reuse the shared GenAI utilities for:

  • span lifecycle
  • semantic-convention attribute mapping
  • content-capture mode handling
  • message/tool content serialization
  • metrics emission
  • error handling

Embeddings should also be evaluated for migration to a shared util/genai path, or util/genai should be extended with a reusable embedding invocation shape if the existing LLMInvocation is not a good fit.

P1: Enrich token and response metadata

Responses and newer OpenAI models expose useful metadata that is not fully represented today, including token detail fields such as cached tokens, reasoning tokens, and audio token details. The implementation should capture stable semantic-convention fields where available and use clearly documented LoongSuite extension attributes only when no stable semconv field exists.

P1: Capture tool calls and structured output safely

The instrumentation should support function tools and built-in Responses API tools such as web search, file search, code interpreter, and computer-use style outputs where the SDK exposes them. Structured output schemas and tool arguments should only be captured when content capture is enabled.

Proposed telemetry shape

For Responses API model calls:

  • span kind: CLIENT
  • provider/system: OpenAI
  • operation: chat unless the semantic convention adds a more specific Responses operation
  • span name: chat <model> or equivalent existing GenAI naming pattern
  • request attributes: model, instructions, input shape, tools/tool_choice, parallel_tool_calls, max_output_tokens, temperature, top_p, reasoning config, service tier, previous_response_id, conversation/background/store indicators when present
  • response attributes: response id, response model, status, finish reasons, usage input/output tokens, service tier, and relevant tool-call metadata
  • metrics: operation duration and token usage with the same common dimensions as existing OpenAI v2 instrumentation
  • events/content: input/output messages, tool call requests, tool call responses, reasoning/text parts, and multimodal references only according to the configured content-capture mode

Test plan

Add focused tests for:

  • sync responses.create
  • async responses.create
  • stream=True
  • responses.stream and async stream helpers
  • raw response parse and context-manager behavior
  • tool calls and built-in tool outputs
  • multimodal input and output mapping
  • reasoning/token detail extraction
  • incomplete/failed/cancelled/error paths
  • content capture on/off
  • unsampled spans
  • metrics for duration and token usage

Existing Chat Completions and Embeddings tests should continue to pass.

Documentation

Update the OpenAI GenAI instrumentation docs to describe:

  • supported OpenAI API surfaces
  • Responses API support and streaming behavior
  • content-capture privacy behavior
  • token detail / extension attribute behavior
  • API surfaces intentionally not mapped to GenAI spans, such as management or CRUD APIs where plain HTTP/client telemetry is more appropriate

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions