Skip to content

Decompose response_speed into response_speed_with_tool_calls and response_speed_no_tool_calls#57

Draft
fanny-riols wants to merge 6 commits intomainfrom
pr/fr/response_speed_decomposition
Draft

Decompose response_speed into response_speed_with_tool_calls and response_speed_no_tool_calls#57
fanny-riols wants to merge 6 commits intomainfrom
pr/fr/response_speed_decomposition

Conversation

@fanny-riols
Copy link
Copy Markdown
Collaborator

@fanny-riols fanny-riols commented Apr 14, 2026

Summary

Splits response_speed into two filtered variants so latency can be compared between turns that required a tool call and turns that didn't:

  • response_speed_with_tool_calls — mean latency (user utterance end → assistant response start) for turns where the assistant made at least one tool call
  • response_speed_no_tool_calls — same, but restricted to turns with no tool calls

Both metrics are registered as diagnostic / exclude_from_pass_at_k. They use per_turn_latency from the turn_taking metric (read from metrics/turn_taking/details/per_turn_latency in the record's metrics.json), which gives a direct turn_id → latency mapping. Each turn is then checked against conversation_trace for tool_call entries on that turn_id. The base response_speed metric is unchanged (still uses Pipecat's UserBotLatencyObserver).

Example results across 150 records:

Metric Mean Min Max
response_speed 10.67 s 4.74 s 14.65 s
response_speed_with_tool_calls 11.67 s 5.32 s 16.37 s
response_speed_no_tool_calls 8.51 s 3.40 s 12.38 s

Tool-call turns were ~3.2 s slower on average in this example.

Also included

  • apply_env_overrides: deployments with redacted secrets that aren't in the current EVA_MODEL_LIST now warn-and-skip instead of raising, as long as they aren't the active LLM. This allows metrics-only reruns in environments that don't have every deployment from the original run configured.
  • _build_history: added _resolve_path() so pipecat_logs.jsonl / elevenlabs_events.jsonl fall back to output_dir/<filename> when the stored path no longer exists — fixes metric reruns after a run directory is moved.
  • Analysis app: both new metrics added to _NON_NORMALIZED_METRICS so they render as standalone seconds bar charts.

…etrics

Splits the existing response_speed diagnostic metric into two filtered
variants based on whether the assistant made a tool call in the turn.
Parses conversation_trace to map each latency to its turn and checks
for tool_call entries on that turn_id.

Shared logic (sanity filtering, mean/max, MetricScore construction) is
extracted into a _ResponseSpeedBase class; each variant only implements
_get_latencies(). Bumps metrics_version to 0.1.2.
…DEL_LIST

When restoring redacted secrets in apply_env_overrides, skip deployments
that are not present in the current environment's EVA_MODEL_LIST rather
than raising a ValueError. Only raise if the missing deployment is the
active LLM for this run. This allows metrics-only reruns in environments
that don't have every deployment from the original run configured.
Adds _resolve_path() helper that returns the stored path if it exists on
disk, otherwise falls back to output_dir/<filename>. Used in _build_history
for pipecat_logs.jsonl and elevenlabs_events.jsonl so that metric reruns
work correctly when a run directory has been moved from its original location.
…in analysis app

Adds both new metrics to _NON_NORMALIZED_METRICS so they are rendered as
standalone seconds bar charts alongside response_speed. Category grouping,
color, and table sorting are handled dynamically via the metric registry.
…onse speed metrics

The filtered variants now read metrics/turn_taking/details/per_turn_latency
from the record's metrics.json instead of using context.response_speed_latencies.
This gives a direct turn_id → latency mapping, avoiding the index-based
alignment that was previously needed to correlate latencies with tool calls.

The base response_speed metric is unchanged (still uses UserBotLatencyObserver).
…NoToolCallsMetric

Tests cover: missing output_dir, missing metrics.json, missing turn_taking
data, no tool-call turns, all tool-call turns, mixed turns (correct split),
invalid latency filtering, and an exhaustiveness check that with_tool +
no_tool latencies together equal the full per_turn_latency set.


@register_metric
class ResponseSpeedWithToolCallsMetric(_ResponseSpeedBase):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these as new metrics? In my mind, they should be sub-fields computed by the response_speed metric, as we will do for turn-taking. Our number of metrics will quickly explode otherwise?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants