Decompose response_speed into response_speed_with_tool_calls and response_speed_no_tool_calls by fanny-riols · Pull Request #57 · ServiceNow/eva

fanny-riols · 2026-04-14T16:38:21Z

Summary

Splits response_speed into two filtered variants so latency can be compared between turns that required a tool call and turns that didn't:

response_speed_with_tool_calls — mean latency (user utterance end → assistant response start) for turns where the assistant made at least one tool call
response_speed_no_tool_calls — same, but restricted to turns with no tool calls

Both metrics are registered as diagnostic / exclude_from_pass_at_k. They use per_turn_latency from the turn_taking metric (read from metrics/turn_taking/details/per_turn_latency in the record's metrics.json), which gives a direct turn_id → latency mapping. Each turn is then checked against conversation_trace for tool_call entries on that turn_id. The base response_speed metric is unchanged (still uses Pipecat's UserBotLatencyObserver).

Example results across 150 records:

Metric	Mean	Min	Max
`response_speed`	10.67 s	4.74 s	14.65 s
`response_speed_with_tool_calls`	11.67 s	5.32 s	16.37 s
`response_speed_no_tool_calls`	8.51 s	3.40 s	12.38 s

Tool-call turns were ~3.2 s slower on average in this example.

Also included

apply_env_overrides: deployments with redacted secrets that aren't in the current EVA_MODEL_LIST now warn-and-skip instead of raising, as long as they aren't the active LLM. This allows metrics-only reruns in environments that don't have every deployment from the original run configured.
_build_history: added _resolve_path() so pipecat_logs.jsonl / elevenlabs_events.jsonl fall back to output_dir/<filename> when the stored path no longer exists — fixes metric reruns after a run directory is moved.
Analysis app: both new metrics added to _NON_NORMALIZED_METRICS so they render as standalone seconds bar charts.

…etrics Splits the existing response_speed diagnostic metric into two filtered variants based on whether the assistant made a tool call in the turn. Parses conversation_trace to map each latency to its turn and checks for tool_call entries on that turn_id. Shared logic (sanity filtering, mean/max, MetricScore construction) is extracted into a _ResponseSpeedBase class; each variant only implements _get_latencies(). Bumps metrics_version to 0.1.2.

…DEL_LIST When restoring redacted secrets in apply_env_overrides, skip deployments that are not present in the current environment's EVA_MODEL_LIST rather than raising a ValueError. Only raise if the missing deployment is the active LLM for this run. This allows metrics-only reruns in environments that don't have every deployment from the original run configured.

Adds _resolve_path() helper that returns the stored path if it exists on disk, otherwise falls back to output_dir/<filename>. Used in _build_history for pipecat_logs.jsonl and elevenlabs_events.jsonl so that metric reruns work correctly when a run directory has been moved from its original location.

…in analysis app Adds both new metrics to _NON_NORMALIZED_METRICS so they are rendered as standalone seconds bar charts alongside response_speed. Category grouping, color, and table sorting are handled dynamically via the metric registry.

…onse speed metrics The filtered variants now read metrics/turn_taking/details/per_turn_latency from the record's metrics.json instead of using context.response_speed_latencies. This gives a direct turn_id → latency mapping, avoiding the index-based alignment that was previously needed to correlate latencies with tool calls. The base response_speed metric is unchanged (still uses UserBotLatencyObserver).

…NoToolCallsMetric Tests cover: missing output_dir, missing metrics.json, missing turn_taking data, no tool-call turns, all tool-call turns, mixed turns (correct split), invalid latency filtering, and an exhaustiveness check that with_tool + no_tool latencies together equal the full per_turn_latency set.

gabegma · 2026-04-14T20:33:05Z

src/eva/metrics/diagnostic/response_speed.py

+
+
+@register_metric
+class ResponseSpeedWithToolCallsMetric(_ResponseSpeedBase):


Do we need these as new metrics? In my mind, they should be sub-fields computed by the response_speed metric, as we will do for turn-taking. Our number of metrics will quickly explode otherwise?

fanny-riols added 6 commits April 14, 2026 12:28

gabegma reviewed Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decompose response_speed into response_speed_with_tool_calls and response_speed_no_tool_calls#57

Decompose response_speed into response_speed_with_tool_calls and response_speed_no_tool_calls#57
fanny-riols wants to merge 6 commits intomainfrom
pr/fr/response_speed_decomposition

fanny-riols commented Apr 14, 2026 •

edited

Loading

Uh oh!

gabegma Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		@register_metric
		class ResponseSpeedWithToolCallsMetric(_ResponseSpeedBase):

Conversation

fanny-riols commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Also included

Uh oh!

gabegma Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fanny-riols commented Apr 14, 2026 •

edited

Loading