Skip to content

Commit b08e48f

Browse files
committed
feat(streaming): add ttat (time-to-first-answering-token)
ttft fires on the first content delta of any kind, which for reasoning models means the first reasoning chunk — arrives quickly even when the user-perceived latency is much longer. ttat fires only on the first user-visible answer token (text delta or tool-call arguments delta), excluding reasoning chunks. For non-reasoning models the two are equal; for gpt-5-class / o-series models they differ by the reasoning duration. This pairs with ttft for "did the model start thinking quickly?" vs "how long did the user wait for an answer?" — both are valuable signals that mean different things on reasoning workloads. Implementation: a third bookmark variable (``first_answer_at``) set inside the same up-front event-type check, restricted to ResponseTextDeltaEvent / ResponseFunctionCallArgumentsDeltaEvent. Adds one new histogram (``agentex.llm.ttat``) — same labels and units as ttft.
1 parent da85d7b commit b08e48f

2 files changed

Lines changed: 25 additions & 2 deletions

File tree

src/agentex/lib/core/observability/llm_metrics.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,15 @@ def __init__(self) -> None:
4747
unit="ms",
4848
description="Time from request submission to first content token (ms)",
4949
)
50+
# ttat (time-to-first-answering-token) is distinct from ttft for reasoning
51+
# models: ttft fires on the first reasoning chunk (which arrives quickly),
52+
# while ttat fires on the first user-visible answer token (text or tool
53+
# call). For non-reasoning models the two are equal.
54+
self.ttat_ms = meter.create_histogram(
55+
name="agentex.llm.ttat",
56+
unit="ms",
57+
description="Time from request submission to first answering token (text or tool-call delta) — excludes reasoning chunks",
58+
)
5059
# Note: TPS denominator is the model-generation window
5160
# (last_token_time - first_token_time), not total stream wall time.
5261
# This isolates raw model throughput from event-loop / tool-call latency.

src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -653,12 +653,16 @@ async def get_response(
653653
reasoning_summaries = []
654654
reasoning_contents = []
655655
event_count = 0
656-
# ttft / tps instrumentation. ``stream_start_perf`` is set above,
657-
# before the responses.create() await, so it captures the full
656+
# ttft / ttat / tps instrumentation. ``stream_start_perf`` is set
657+
# above, before the responses.create() await, so it captures the full
658658
# request-to-first-token latency. ``first_token_at`` and
659659
# ``last_token_at`` bracket the model-generation window for tps.
660+
# ``first_answer_at`` is set on the first user-visible answer token
661+
# (text or tool-call delta) and excludes reasoning chunks, so ttat
662+
# measures the latency users actually perceive on reasoning models.
660663
first_token_at: Optional[float] = None
661664
last_token_at: Optional[float] = None
665+
first_answer_at: Optional[float] = None
662666

663667
# We expect task_id to always be provided for streaming
664668
if not task_id:
@@ -686,6 +690,14 @@ async def get_response(
686690
if first_token_at is None:
687691
first_token_at = now_perf
688692
last_token_at = now_perf
693+
# ttat: first user-visible answer token (text or tool call),
694+
# excluding reasoning chunks. Equal to ttft for non-reasoning
695+
# models; differs by reasoning duration for reasoning models.
696+
if first_answer_at is None and isinstance(event, (
697+
ResponseTextDeltaEvent,
698+
ResponseFunctionCallArgumentsDeltaEvent,
699+
)):
700+
first_answer_at = now_perf
689701

690702
# Handle different event types using isinstance for type safety
691703
if isinstance(event, ResponseOutputItemAddedEvent):
@@ -1027,6 +1039,8 @@ async def get_response(
10271039
m.reasoning_tokens.add(usage.output_tokens_details.reasoning_tokens or 0, metric_attrs)
10281040
if first_token_at is not None:
10291041
m.ttft_ms.record((first_token_at - stream_start_perf) * 1000, metric_attrs)
1042+
if first_answer_at is not None:
1043+
m.ttat_ms.record((first_answer_at - stream_start_perf) * 1000, metric_attrs)
10301044
# tps denominator is the generation window (first→last delta), not
10311045
# total stream wall time — see LLMMetrics for rationale. Single-token
10321046
# responses (where first_token_at == last_token_at, e.g. a one-token

0 commit comments

Comments
 (0)