Decompose response_speed into response_speed_with_tool_calls and response_speed_no_tool_calls#57
Draft
fanny-riols wants to merge 6 commits intomainfrom
Draft
Decompose response_speed into response_speed_with_tool_calls and response_speed_no_tool_calls#57fanny-riols wants to merge 6 commits intomainfrom
fanny-riols wants to merge 6 commits intomainfrom
Conversation
…etrics Splits the existing response_speed diagnostic metric into two filtered variants based on whether the assistant made a tool call in the turn. Parses conversation_trace to map each latency to its turn and checks for tool_call entries on that turn_id. Shared logic (sanity filtering, mean/max, MetricScore construction) is extracted into a _ResponseSpeedBase class; each variant only implements _get_latencies(). Bumps metrics_version to 0.1.2.
…DEL_LIST When restoring redacted secrets in apply_env_overrides, skip deployments that are not present in the current environment's EVA_MODEL_LIST rather than raising a ValueError. Only raise if the missing deployment is the active LLM for this run. This allows metrics-only reruns in environments that don't have every deployment from the original run configured.
Adds _resolve_path() helper that returns the stored path if it exists on disk, otherwise falls back to output_dir/<filename>. Used in _build_history for pipecat_logs.jsonl and elevenlabs_events.jsonl so that metric reruns work correctly when a run directory has been moved from its original location.
…in analysis app Adds both new metrics to _NON_NORMALIZED_METRICS so they are rendered as standalone seconds bar charts alongside response_speed. Category grouping, color, and table sorting are handled dynamically via the metric registry.
…onse speed metrics The filtered variants now read metrics/turn_taking/details/per_turn_latency from the record's metrics.json instead of using context.response_speed_latencies. This gives a direct turn_id → latency mapping, avoiding the index-based alignment that was previously needed to correlate latencies with tool calls. The base response_speed metric is unchanged (still uses UserBotLatencyObserver).
…NoToolCallsMetric Tests cover: missing output_dir, missing metrics.json, missing turn_taking data, no tool-call turns, all tool-call turns, mixed turns (correct split), invalid latency filtering, and an exhaustiveness check that with_tool + no_tool latencies together equal the full per_turn_latency set.
gabegma
reviewed
Apr 14, 2026
|
|
||
|
|
||
| @register_metric | ||
| class ResponseSpeedWithToolCallsMetric(_ResponseSpeedBase): |
Collaborator
There was a problem hiding this comment.
Do we need these as new metrics? In my mind, they should be sub-fields computed by the response_speed metric, as we will do for turn-taking. Our number of metrics will quickly explode otherwise?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Splits
response_speedinto two filtered variants so latency can be compared between turns that required a tool call and turns that didn't:response_speed_with_tool_calls— mean latency (user utterance end → assistant response start) for turns where the assistant made at least one tool callresponse_speed_no_tool_calls— same, but restricted to turns with no tool callsBoth metrics are registered as
diagnostic/exclude_from_pass_at_k. They useper_turn_latencyfrom theturn_takingmetric (read frommetrics/turn_taking/details/per_turn_latencyin the record'smetrics.json), which gives a directturn_id → latencymapping. Each turn is then checked againstconversation_tracefortool_callentries on thatturn_id. The baseresponse_speedmetric is unchanged (still uses Pipecat'sUserBotLatencyObserver).Example results across 150 records:
response_speedresponse_speed_with_tool_callsresponse_speed_no_tool_callsTool-call turns were ~3.2 s slower on average in this example.
Also included
apply_env_overrides: deployments with redacted secrets that aren't in the currentEVA_MODEL_LISTnow warn-and-skip instead of raising, as long as they aren't the active LLM. This allows metrics-only reruns in environments that don't have every deployment from the original run configured._build_history: added_resolve_path()sopipecat_logs.jsonl/elevenlabs_events.jsonlfall back tooutput_dir/<filename>when the stored path no longer exists — fixes metric reruns after a run directory is moved._NON_NORMALIZED_METRICSso they render as standalone seconds bar charts.