Skip to content

feat(observability): add messages genai span attributes#68

Merged
bzp2010 merged 5 commits into
mainfrom
bzp/feat-o11y-tracelog-messages-api
Apr 29, 2026
Merged

feat(observability): add messages genai span attributes#68
bzp2010 merged 5 commits into
mainfrom
bzp/feat-o11y-tracelog-messages-api

Conversation

@bzp2010
Copy link
Copy Markdown
Collaborator

@bzp2010 bzp2010 commented Apr 29, 2026

Summary by CodeRabbit

  • New Features

    • Enhanced tracing for /v1/messages with parent spans, explicit error typing, and refined streaming behavior (first-token events, streamed usage handling).
    • Streaming collector now assembles message content, images, tool calls, and finish reasons; serializable GenAI input/output and tool-definition payloads are produced.
  • Refactor

    • Centralized span and message attribute helpers into shared trace utilities for consistent telemetry across handlers.
  • Tests

    • Added unit tests validating request/response, chunk, and full-stream span-attribute generation.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Centralizes message-view and span-property helpers under proxy::utils::trace, updates chat-completions code to re-export and use those shared helpers, and adds Anthropic messages span-attribute support including a streaming output collector and tests.

Changes

Cohort / File(s) Summary
Shared Trace Utilities
src/proxy/utils/mod.rs, src/proxy/utils/trace/mod.rs, src/proxy/utils/trace/span_attributes.rs, src/proxy/utils/trace/span_message_attributes.rs
Add centralized trace helpers: finish-reason and usage property builders (span_attributes.rs) and message/tool view models plus JSON/span-property serialization (span_message_attributes.rs). Expose under proxy::utils::trace.
Chat Completions Span Attributes
src/proxy/handlers/chat_completions/span_attributes/message_attributes.rs, src/proxy/handlers/chat_completions/span_attributes/mod.rs, src/proxy/handlers/chat_completions/span_attributes/stream_output.rs, src/proxy/handlers/chat_completions/span_attributes/telemetry.rs
Remove local message-view/type definitions and JSON/property builders; re-export and delegate to shared span_message_attributes/span_attributes. Switch JSON serialization to shared helper and update re-exports to new utils.
Anthropic Messages Span Attributes
src/proxy/handlers/messages/span_attributes/message_attributes.rs, src/proxy/handlers/messages/span_attributes/mod.rs, src/proxy/handlers/messages/span_attributes/stream_output.rs, src/proxy/handlers/messages/span_attributes/telemetry.rs, src/proxy/handlers/messages/span_attributes/tests.rs
Add Anthropic-specific span-attribute modules: convert Anthropic request/response/stream events to shared Message/Output views, implement StreamOutputCollector, tool-definition JSON and OpenInference-style span properties, telemetry builders, and unit tests.
Messages Handler Core Logic
src/proxy/handlers/messages/mod.rs
Refactor handler to create a local tracing span (aisix.llm.messages), apply structured request/response properties, propagate span via WithSpan, integrate streaming with StreamOutputCollector, defer usage observer spawn, and set explicit error-type span annotations for gateway/timeouts/stream errors.
Module & Exports Wiring
src/proxy/mod.rs, src/proxy/utils/mod.rs, src/proxy/utils/trace/mod.rs, src/proxy/handlers/messages/span_attributes/mod.rs, src/proxy/handlers/chat_completions/span_attributes/mod.rs
Add utils subtree, export trace modules, and adjust pub(super) re-exports to reference the new centralized helpers and message-view types.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Handler as Messages Handler
    participant Span as Span (fastrace)
    participant Gateway
    participant Collector as StreamOutputCollector

    Client->>Handler: POST /v1/messages
    Handler->>Span: create & enter span "aisix.llm.messages"
    Handler->>Span: apply request span properties (span_attributes)

    alt Non-streaming response
        Handler->>Gateway: call gateway
        Gateway-->>Handler: AnthropicMessagesResponse
        Handler->>Span: build output views -> apply response & usage properties
        Handler-->>Client: HTTP response
    else Streaming response
        Handler->>Gateway: stream gateway
        Gateway-->>Handler: stream events
        loop For each event
            Handler->>Collector: record event (MessageStart/ContentBlockDelta/MessageDelta)
            Handler->>Span: apply chunk span properties
        end
        alt Stream completed
            Handler->>Collector: build output_message_views
            Handler->>Span: apply message-level properties
            Handler->>Span: apply usage properties (sync or spawn observer)
        else Stream error
            Handler->>Span: set error.type=stream_error
            Handler->>Span: flush collector properties
        end
        Handler-->>Client: SSE stream
    end

    Span->>Span: exit & record properties
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Security Check ❌ Error Pull request captures sensitive user data (prompts, IDs, tool definitions) in span attributes that will be logged in observability systems, exposing PII and confidential information. Implement message content redaction by excluding text from span properties, using redaction layer, or adding configuration flag to disable detailed message logging in production.
E2e Test Quality Review ⚠️ Warning Test suite contains only unit tests for span property generation, not E2E tests of the handler function and streaming flow. Add E2E/integration tests verifying the messages() handler invocation, span creation/attachment, streaming event processing, error scenarios (gateway_error, timeout, stream_error), and unfold state transitions.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding observability/tracing span attributes (genai) for the messages API handler, which is reflected across multiple files establishing span attribute generation, telemetry, and streaming output collection for Anthropic messages.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bzp/feat-o11y-tracelog-messages-api

Review rate limit: 3/5 reviews remaining, refill in 17 minutes and 45 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/proxy/handlers/messages/mod.rs`:
- Around line 215-221: The stream-error branch currently hands off usage_rx to
spawn_stream_usage_observer and drops the span, losing usage attributes if usage
is already ready; modify the Some(Err(err)) arm to mirror the EOF fast-path by
calling try_recv() on the usage_rx (the same receiver variable), and if it
yields Some(usage) attach span.add_property(|| usage_span_properties(&usage))
(or add_properties with usage_span_properties()) and also call
span.add_properties(|| output_collector.output_message_span_properties()) before
deciding whether to spawn_stream_usage_observer(request_ctx.clone(), usage_rx) —
only move the receiver into spawn_stream_usage_observer when try_recv() returns
None. Ensure you reference the same usage_rx, span, spawn_stream_usage_observer,
usage_span_properties(), and output_collector.output_message_span_properties()
symbols.
- Around line 190-195: The first-token logic currently fires on idx == 0 which
triggers on the initial SSE "message_start" event; change the condition so
record_first_token_latency(&mut request_ctx).await and the TraceEvent "first
token arrived" are only executed for the first content-bearing SSE event (i.e.,
when idx == 0 AND the event contains actual output text/content rather than a
"message_start" control event). Locate the block around idx == 0 and update the
guard to check the event payload/type (skip events labeled "message_start" or
missing content) before calling hooks::observability::record_first_token_latency
and adding the TraceEvent.

In `@src/proxy/handlers/messages/span_attributes/message_attributes.rs`:
- Around line 81-86: The function image_source_to_data_url currently returns
full data URIs for ImageSource entries with type "base64", which can leak large
payloads into spans; change image_source_to_data_url (and any callers expecting
the full string) to avoid embedding raw base64 bodies and instead return
metadata only — e.g., include source.media_type plus either the byte length or a
short checksum/hex hash of source.data (or a sentinel like
"inline-base64-omitted") so callers (gen_ai.* message builders / OpenInference
fields) log media_type and size/hash but never the full payload. Ensure the
function still handles non-base64 sources by returning source.data unchanged and
update any downstream uses that assumed the full data URI to consume the
metadata token instead.

In `@src/proxy/handlers/messages/span_attributes/tests.rs`:
- Around line 1-2: Add the pretty_assertions assert_eq import to the test module
so failures show diffs: in src/proxy/handlers/messages/span_attributes/tests.rs
add `use pretty_assertions::assert_eq;` alongside the existing `use
serde_json::{Value, json};` import (or replace any local shadowed assert_eq
imports), then run the tests to ensure all existing `assert_eq!` calls in this
module now use pretty_assertions' assert_eq for clearer output.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 760da4f6-f870-4eab-a272-ffbe58cbf633

📥 Commits

Reviewing files that changed from the base of the PR and between ba4259e and 370d53d.

📒 Files selected for processing (15)
  • src/proxy/handlers/chat_completions/span_attributes/message_attributes.rs
  • src/proxy/handlers/chat_completions/span_attributes/mod.rs
  • src/proxy/handlers/chat_completions/span_attributes/stream_output.rs
  • src/proxy/handlers/chat_completions/span_attributes/telemetry.rs
  • src/proxy/handlers/messages/mod.rs
  • src/proxy/handlers/messages/span_attributes/message_attributes.rs
  • src/proxy/handlers/messages/span_attributes/mod.rs
  • src/proxy/handlers/messages/span_attributes/stream_output.rs
  • src/proxy/handlers/messages/span_attributes/telemetry.rs
  • src/proxy/handlers/messages/span_attributes/tests.rs
  • src/proxy/mod.rs
  • src/proxy/utils/mod.rs
  • src/proxy/utils/trace/mod.rs
  • src/proxy/utils/trace/span_attributes.rs
  • src/proxy/utils/trace/span_message_attributes.rs

Comment thread src/proxy/handlers/messages/mod.rs Outdated
Comment on lines 215 to +221
Some(Err(err)) => {
error!("Gateway stream error: {}", err);
span.add_property(|| ("error.type", "stream_error"));
span.add_properties(|| output_collector.output_message_span_properties());
if let Some(usage_rx) = usage_rx.take() {
spawn_stream_usage_observer(request_ctx.clone(), usage_rx);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mirror the EOF usage fast-path on stream errors.

The EOF branch does a try_recv() and attaches usage_span_properties(), but this branch immediately hands the receiver to spawn_stream_usage_observer() and then drops the span. If usage is already available, failed streams still lose all usage attributes in tracing. Reuse the same try_recv() logic here before giving up the span.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/proxy/handlers/messages/mod.rs` around lines 215 - 221, The stream-error
branch currently hands off usage_rx to spawn_stream_usage_observer and drops the
span, losing usage attributes if usage is already ready; modify the
Some(Err(err)) arm to mirror the EOF fast-path by calling try_recv() on the
usage_rx (the same receiver variable), and if it yields Some(usage) attach
span.add_property(|| usage_span_properties(&usage)) (or add_properties with
usage_span_properties()) and also call span.add_properties(||
output_collector.output_message_span_properties()) before deciding whether to
spawn_stream_usage_observer(request_ctx.clone(), usage_rx) — only move the
receiver into spawn_stream_usage_observer when try_recv() returns None. Ensure
you reference the same usage_rx, span, spawn_stream_usage_observer,
usage_span_properties(), and output_collector.output_message_span_properties()
symbols.

Comment on lines +81 to +86
pub(super) fn image_source_to_data_url(source: &ImageSource) -> String {
if source.r#type == "base64" {
format!("data:{};base64,{}", source.media_type, source.data)
} else {
source.data.clone()
}
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid tracing raw base64 image payloads.

For base64 sources this builds a full data: URI, and that value is later copied into gen_ai.*.messages and OpenInference message content fields. A single inline image can blow span-size limits and leak the full image payload into telemetry. Record metadata only (for example media type, size, or a hash) or skip inline-image bodies entirely.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/proxy/handlers/messages/span_attributes/message_attributes.rs` around
lines 81 - 86, The function image_source_to_data_url currently returns full data
URIs for ImageSource entries with type "base64", which can leak large payloads
into spans; change image_source_to_data_url (and any callers expecting the full
string) to avoid embedding raw base64 bodies and instead return metadata only —
e.g., include source.media_type plus either the byte length or a short
checksum/hex hash of source.data (or a sentinel like "inline-base64-omitted") so
callers (gen_ai.* message builders / OpenInference fields) log media_type and
size/hash but never the full payload. Ensure the function still handles
non-base64 sources by returning source.data unchanged and update any downstream
uses that assumed the full data URI to consume the metadata token instead.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create an issue for this.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

gh issue create \
  --repo api7/aisix \
  --title "fix(observability): avoid embedding raw base64 image payloads in span attributes" \
  --body "## Problem

In \`src/proxy/handlers/messages/span_attributes/message_attributes.rs\`, the function \`image_source_to_data_url\` currently constructs a full \`data:<media_type>;base64,<data>\` URI for \`ImageSource\` entries with type \`\"base64\"\`. This value is propagated into \`gen_ai.*\` message JSON and OpenInference message content span properties.

This has two risks:
1. **Span size limits** — A single inline image can easily exceed span size limits in observability backends (e.g., Jaeger, OTLP collectors), causing spans to be dropped or truncated.
2. **Data leakage** — Full image payloads are recorded verbatim into telemetry, which may not be appropriate.

## Expected Behaviour

\`image_source_to_data_url\` (and any callers) should record metadata only — e.g., \`media_type\` plus either the byte length or a short checksum/hash of \`source.data\`, or a sentinel like \`\"inline-base64-omitted\"\`. Non-base64 sources (\`url\` type) should continue to return \`source.data\` unchanged. Downstream consumers (\`content_parts_from_blocks\`, stream output collector, etc.) should be updated to handle the metadata token.

## References

- PR: https://github.com/api7/aisix/pull/68
- Review comment: https://github.com/api7/aisix/pull/68#discussion_r3159393407

/cc `@bzp2010`" \
  --assignee bzp2010

Length of output: 40


@bzp2010, I've created the follow-up issue: #69

It tracks the concern about image_source_to_data_url embedding full base64 data URIs into span attributes, covering both the span-size and data-leakage risks, and is assigned to you.


🧠 Learnings used
Learnt from: bzp2010
Repo: api7/aisix PR: 34
File: src/proxy/hooks2/authorization/mod.rs:1-1
Timestamp: 2026-04-10T09:35:05.745Z
Learning: In Rust code using the `anyhow` crate, treat `use anyhow::Result;` as potentially used even if the code uses non-default error types like `Result<T, SomeOtherError>` (i.e., `Result<T, E>` where `E` is not `anyhow::Error`). Because `anyhow::Result` is an alias for `std::result::Result<T, E=anyhow::Error>`, the import is referenced by the `Result<..., ...>` syntax. Do not flag `use anyhow::Result;` as unused solely due to the presence of `Result<T, SomeOtherError>` usages.

Learnt from: AlinsRan
Repo: api7/aisix PR: 46
File: src/lib.rs:122-126
Timestamp: 2026-04-17T09:32:54.079Z
Learning: Avoid adding `#[fastrace::trace]` to top-level process-lifetime entrypoints in `src/lib.rs`—specifically `run()`, `run_with_config()`, and `run_with_provider()`. These functions cover the entire server lifetime (startup → shutdown), while Fastrace spans should be used for request-scoped boundaries (e.g., HTTP/RPC handler functions) rather than global process entrypoints.

Comment thread src/proxy/handlers/messages/span_attributes/tests.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/proxy/handlers/messages/mod.rs (1)

215-233: ⚠️ Potential issue | 🟡 Minor

Mirror the EOF try_recv() pattern for stream errors.

The error branch spawns the usage observer unconditionally, but the EOF branch (lines 238-264) first attempts try_recv() to capture usage synchronously before the span is dropped. If usage is already available when an error occurs, it should be attached to the span immediately rather than lost.

🔧 Suggested fix
 Some(Err(err)) => {
     error!("Gateway stream error: {}", err);
     span.add_property(|| ("error.type", "stream_error"));
     span.add_properties(|| output_collector.output_message_span_properties());
-    if let Some(usage_rx) = usage_rx.take() {
-        spawn_stream_usage_observer(request_ctx.clone(), usage_rx);
+    if let Some(mut usage_rx) = usage_rx.take() {
+        match usage_rx.try_recv() {
+            Ok(usage) => {
+                span.add_properties(|| usage_span_properties(&usage));
+            }
+            Err(TryRecvError::Empty) => {
+                spawn_stream_usage_observer(request_ctx.clone(), usage_rx);
+            }
+            Err(TryRecvError::Closed) => {
+                error!("Failed to receive streaming usage from gateway: channel closed");
+            }
+        }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/proxy/handlers/messages/mod.rs` around lines 215 - 233, The stream-error
branch currently unconditionally spawns the usage observer and drops the span
without first attempting to synchronously capture outstanding usage like the EOF
branch does; update the Some(Err(err)) arm to mirror the EOF try_recv() pattern:
if let Some(mut usage_rx) = usage_rx.take() then first attempt to capture usage
synchronously via try_recv() (attach any returned usage to the span using
span.add_property/add_properties and
output_collector.output_message_span_properties()), and only call
spawn_stream_usage_observer(request_ctx.clone(), usage_rx) if try_recv() did not
yield usage; keep the existing error event creation
(anthropic_error_sse_event(err.to_string())) and other returned tuple elements
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/proxy/handlers/messages/mod.rs`:
- Around line 215-233: The stream-error branch currently unconditionally spawns
the usage observer and drops the span without first attempting to synchronously
capture outstanding usage like the EOF branch does; update the Some(Err(err))
arm to mirror the EOF try_recv() pattern: if let Some(mut usage_rx) =
usage_rx.take() then first attempt to capture usage synchronously via try_recv()
(attach any returned usage to the span using span.add_property/add_properties
and output_collector.output_message_span_properties()), and only call
spawn_stream_usage_observer(request_ctx.clone(), usage_rx) if try_recv() did not
yield usage; keep the existing error event creation
(anthropic_error_sse_event(err.to_string())) and other returned tuple elements
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f89fd655-3cb4-41f0-9f3b-d6d4e297acce

📥 Commits

Reviewing files that changed from the base of the PR and between 370d53d and 8b5ee93.

📒 Files selected for processing (2)
  • src/proxy/handlers/messages/mod.rs
  • src/proxy/handlers/messages/span_attributes/tests.rs

@bzp2010 bzp2010 merged commit 1f17b33 into main Apr 29, 2026
2 checks passed
@bzp2010 bzp2010 deleted the bzp/feat-o11y-tracelog-messages-api branch April 29, 2026 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant