Skip to content

feat(cachet): structured telemetry with spans, events, and handler API#460

Open
schgoo wants to merge 20 commits into
mainfrom
u/schgoo/coalescemetrics
Open

feat(cachet): structured telemetry with spans, events, and handler API#460
schgoo wants to merge 20 commits into
mainfrom
u/schgoo/coalescemetrics

Conversation

@schgoo
Copy link
Copy Markdown
Collaborator

@schgoo schgoo commented May 28, 2026

Summary

Replaces cachet's event-only telemetry with a span + event model that provides per-tier timing, request correlation, and a callback API for consumers to build custom telemetry pipelines.

Motivation

The previous telemetry emitted standalone tracing events with no correlation between tiers. Consumers couldn't distinguish which tier events belonged to which cache operation, and there was no way to subscribe to structured telemetry without parsing tracing fields via the visitor pattern.

Changes

Tracing spans + events

  • Each public Cache method (get, insert, invalidate, clear, get_or_insert, etc.) creates a parent span via CacheTelemetry
  • Each CacheWrapper tier operation creates a child span, producing a nested trace: cache.get → cache.tier
  • Events are emitted inside spans at the appropriate severity level (debug for hits/misses, info for inserts/expirations, error for failures)
  • Durations are recorded on both span fields (for trace subscribers) and event fields (for log subscribers)

CacheEventHandler callback API

  • New CacheEventHandler trait with on_tier_event and on_operation_complete callbacks
  • Registered via CacheBuilder::event_handler(handler)
  • Receives typed CacheTierEvent and CacheOperationEvent structs — no tracing visitor boilerplate
  • Works independently of the logs feature flag
  • Designed as the semi-stable consumer API that hopefully survives a future migration to emit

Request correlation

  • Each cache operation gets a unique request_id: u64 from a process-wide atomic counter
  • WithRequestId<F> future wrapper restores the request ID into a thread-local on every poll(), surviving task migration across threads/cores (same pattern as tracing::Instrument)
  • Both CacheTierEvent and CacheOperationEvent carry the request_id for grouping

Fallback as a flag

  • cache.fallback is now a boolean flag on tier events, not a separate event type
  • Indicates whether a tier was consulted as a fallback

Removed

  • telemetry/ext.rs (ClockExt, Timed, TimedResult) — replaced by clock.stopwatch() directly
  • EVENT_FALLBACK and EVENT_REQUEST_MERGED attribute constants
  • CacheTelemetryInner — span creation moved into CacheTelemetry directly

New attributes

  • FIELD_COALESCED — boolean flag for stampede protection
  • FIELD_FALLBACK — boolean flag for fallback tier consultation

Examples

  • telemetry_subscriber — shows span + event output with tracing_subscriber::fmt
  • telemetry_accumulator — demonstrates accumulating tier events into a single summary per operation using CacheEventHandler + DashMap, mirroring a TVS-style consumer pattern

Performance

Benchmarked in release mode (MockCache get, single tier):

Configuration Time
No telemetry 471ns
Telemetry enabled, no subscriber 481ns
With stampede protection 1,122ns

Telemetry with no active subscriber adds ~10ns overhead. The WithRequestId wrapper adds ~300ns on the stampede protection path.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 28, 2026

⚠️ Breaking Changes Detected


--- failure pub_module_level_const_missing: pub module-level const is missing ---

Description:
A public const is missing or renamed
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/pub_module_level_const_missing.ron

Failed in:
  EVENT_FALLBACK in file /home/runner/work/oxidizer/oxidizer/target/semver-checks/git-origin_main/b533b2fe1eaa40943e12e254ea1ee2946e5ff617/crates/cachet/src/telemetry/attributes.rs:59

If the breaking changes are intentional then everything is fine - this message is merely informative.

Remember to apply a version number bump with the correct severity when publishing a version with breaking changes (1.x.x -> 2.x.x or 0.1.x -> 0.2.x).

@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.0%. Comparing base (f63e084) to head (2cb02ae).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff            @@
##             main     #460    +/-   ##
========================================
  Coverage   100.0%   100.0%            
========================================
  Files         313      312     -1     
  Lines       24277    24915   +638     
========================================
+ Hits        24277    24915   +638     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@schgoo schgoo marked this pull request as ready for review June 2, 2026 18:44
Copilot AI review requested due to automatic review settings June 2, 2026 18:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades cachet telemetry from standalone events to a correlated span + event model, and introduces a callback-based handler API so consumers can build structured telemetry pipelines without tracing visitor boilerplate.

Changes:

  • Adds per-operation and per-tier spans/events, including timing fields and flags (coalesced/fallback).
  • Introduces CacheEventHandler with typed CacheTierEvent / CacheOperationEvent and request correlation via request_id.
  • Removes the old timing extension (telemetry/ext.rs) and updates wrappers/builders/examples accordingly.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
crates/cachet/src/wrapper.rs Wraps tier operations in tier spans; records hit/miss/expired/etc with new telemetry API; propagates fallback flag.
crates/cachet/src/telemetry/mod.rs Exposes new handler module and updates module docs.
crates/cachet/src/telemetry/handler.rs Adds public callback API types (CacheEventHandler, event structs, RequestId).
crates/cachet/src/telemetry/ext.rs Removes old ClockExt/Timed timing utilities.
crates/cachet/src/telemetry/cache.rs Implements request correlation, span factories, event recording, and handler emission; adds extensive tests.
crates/cachet/src/telemetry/attributes.rs Updates public telemetry field constants and removes EVENT_FALLBACK in favor of FIELD_FALLBACK.
crates/cachet/src/refresh.rs Updates background refresh flow to use new spans/recording helpers.
crates/cachet/src/lib.rs Updates crate docs for span-based telemetry and re-exports handler types.
crates/cachet/src/fallback.rs Updates fallback get/promotion path to set fallback flag and rely on wrapper tier telemetry.
crates/cachet/src/eviction.rs Routes eviction/expiration causes to new telemetry helpers and adds a no-logging behavior test.
crates/cachet/src/cache.rs Adds operation-level spans/events with request_id correlation and stampede-protection coalesced flag recording.
crates/cachet/src/builder/transform.rs Threads fallback flag through transform tier construction and passes telemetry into Cache::new.
crates/cachet/src/builder/cache.rs Updates enable_logs behavior and adds event_handler(...) registration API.
crates/cachet/src/builder/buildable.rs Extends tier-building to propagate fallback metadata and passes telemetry into Cache::new.
crates/cachet/README.md Updates generated README telemetry docs to match the new span model.
crates/cachet/examples/telemetry_subscriber.rs Simplifies example to show default span+event output using fmt subscriber.
crates/cachet/examples/telemetry_accumulator.rs Adds example demonstrating request-correlated accumulation using CacheEventHandler + DashMap.
crates/cachet/Cargo.toml Adjusts logs feature and dependencies; adds dashmap dev-dependency; registers new example.
crates/cachet/benches/operations.rs Adds a benchmark configuration to measure overhead with an active tracing subscriber.
crates/cachet_memory/README.md Updates generated README dependency info.
Cargo.lock Adds dashmap and removes thread_aware from cachet crate dependency set.
.spelling Adds “moka’s” to dictionary.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/cachet/src/telemetry/cache.rs
Comment thread crates/cachet/src/telemetry/cache.rs
Comment thread crates/cachet/src/eviction.rs Outdated
@ralfbiedert
Copy link
Copy Markdown
Collaborator

Have you had a look at emit? Not sure about ETA, but once we release that I'd imagine we want to switch everything to that.

@schgoo
Copy link
Copy Markdown
Collaborator Author

schgoo commented Jun 3, 2026

Have you had a look at emit? Not sure about ETA, but once we release that I'd imagine we want to switch everything to that.

@ralfbiedert that's the idea (it's actually in the PR description)! I'm not sure when it will be available, though, but I'm hoping this provides a good middle ground.

Copilot AI review requested due to automatic review settings June 3, 2026 16:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 23 changed files in this pull request and generated 7 comments.

Comment thread crates/cachet/src/telemetry/cache.rs
Comment thread crates/cachet/tests/cache.rs Outdated
Comment thread crates/cachet/tests/cache.rs Outdated
Comment thread crates/cachet/tests/cache.rs Outdated
Comment thread crates/cachet/src/telemetry/cache.rs
Comment thread crates/cachet/src/telemetry/cache.rs
Comment thread crates/cachet/Cargo.toml
Copilot AI review requested due to automatic review settings June 3, 2026 18:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 5 comments.

Comment thread crates/cachet/src/telemetry/cache.rs
Comment thread crates/cachet/src/telemetry/cache.rs
Comment thread crates/cachet/src/telemetry/cache.rs
Comment thread crates/cachet/src/telemetry/cache.rs
Comment thread crates/cachet/src/telemetry/cache.rs
Copilot AI review requested due to automatic review settings June 3, 2026 21:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 4 comments.

Comment thread crates/cachet/src/telemetry/cache.rs Outdated
Comment on lines +45 to +63
/// RAII guard that resets the thread-local request ID to 0 on drop,
/// ensuring cleanup even if the inner future panics during poll.
struct ResetRequestId;

impl Drop for ResetRequestId {
fn drop(&mut self) {
CURRENT_REQUEST_ID.with(|cell| cell.set(0));
}
}

/// Internal state for cache telemetry when no features are enabled (no-op).
#[cfg(not(any(feature = "logs", test)))]
#[derive(Clone, Debug, Default)]
pub(crate) struct CacheTelemetryInner;
impl<F: Future> Future for WithRequestId<F> {
type Output = F::Output;

#[cfg(not(any(feature = "logs", test)))]
#[expect(clippy::unused_self, reason = "Methods must match the logs-enabled impl signature")]
impl CacheTelemetryInner {
#[inline]
fn debug(&self, _: CacheName, _: &'static str, _: Duration) {}
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
let this = self.project();
CURRENT_REQUEST_ID.with(|cell| cell.set(*this.request_id));
let _guard = ResetRequestId;
this.inner.poll(cx)
}
Comment thread crates/cachet/src/telemetry/cache.rs Outdated
Comment on lines +154 to +158
if self.logging_enabled {
let span = Span::current();
if !span.is_disabled() {
let duration_ns = saturating_nanos(duration);
span.record(attributes::FIELD_EVENT, event);
Comment on lines +173 to 188
fn record_info_with_duration(&self, event: &'static str, duration: Duration) {
#[cfg(any(feature = "logs", test))]
if self.logging_enabled {
let span = Span::current();
if !span.is_disabled() {
let duration_ns = saturating_nanos(duration);
span.record(attributes::FIELD_EVENT, event);
span.record(attributes::FIELD_DURATION_NS, duration_ns);
tracing::info!(cache.event = event, cache.duration_ns = duration_ns);
}
}
#[cfg(not(any(feature = "logs", test)))]
{
let _ = (event, duration);
}
}
Comment on lines +194 to +209
fn record_error_with_duration(&self, event: &'static str, duration: Duration) {
#[cfg(any(feature = "logs", test))]
if self.logging_enabled {
let span = Span::current();
if !span.is_disabled() {
let duration_ns = saturating_nanos(duration);
span.record(attributes::FIELD_EVENT, event);
span.record(attributes::FIELD_DURATION_NS, duration_ns);
tracing::error!(cache.event = event, cache.duration_ns = duration_ns);
}
}
#[cfg(not(any(feature = "logs", test)))]
{
let _ = (event, duration);
}
}
Copilot AI review requested due to automatic review settings June 4, 2026 17:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 23 changed files in this pull request and generated 7 comments.

Comment on lines +736 to +739
.with_request_id(inner_id)
);
let mut inner_cx = Context::from_waker(waker);
assert!(matches!(inner.as_mut().poll(&mut inner_cx), Poll::Ready(())));
Comment on lines +748 to +751
.with_request_id(outer_id)
);
let mut outer_cx = Context::from_waker(waker);
assert!(matches!(outer.as_mut().poll(&mut outer_cx), Poll::Ready(())));
Comment on lines +308 to +312
pub(crate) fn record_fallback(&self) {
#[cfg(any(feature = "logs", test))]
if self.logging_enabled {
tracing::info!(cache.event = "cache.fallback", cache.fallback = true);
}
Comment on lines +152 to +157
fn record_debug_with_duration(&self, event: &'static str, duration: Duration) {
#[cfg(any(feature = "logs", test))]
if self.logging_enabled {
let duration_ns = saturating_nanos(duration);
tracing::debug!(cache.event = event, cache.duration_ns = duration_ns);
}
Comment on lines +6 to 9
//! Use these constants to filter or match cachet telemetry events in a custom
//! `tracing_subscriber::Layer`. Cache operations emit structured events with
//! `FIELD_NAME`, `FIELD_EVENT`, `FIELD_DURATION_NS`, and additional fields when applicable.
//!
Comment on lines 103 to 105
fn field_constants_match_tracing_field_names() {
// These constants must match the field names used in tracing macros in cache.rs.
assert_eq!(FIELD_NAME, "cache.name");
Comment on lines +233 to +254
let request_id = next_request_id();
let watch = self.clock.stopwatch();
async {
let (result, coalesced) = if let Some(mergers) = &self.mergers {
let owned = key.to_owned();
let storage = &self.storage;
let result = mergers
.get
.execute(key, move || async move { storage.get(&owned).await })
.await
.unwrap_or_else(|panicked| Err(Error::from_source(panicked)));
(result, true)
} else {
let owned = key.to_owned();
(self.storage.get(&owned).await, false)
};
self.telemetry
.complete_operation(request_id, self.name, "cache.get", watch.elapsed(), coalesced);
result
}
.with_request_id(request_id)
.await
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants