Skip to content

tracing-subscriber: remove clone_span on enter #3289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

conradludgate
Copy link

Motivation

I briefly discussed this on Discord. At work, we enabled continuous profiling which gives us a flamegraph on our production workloads. Inspecting one of our hot-paths, I was surprised to see Span::enter being a significant percent of the samples.

This is currently using tracing-subscriber = "0.3.19".

Concrete numbers:
7.7% of samples were found within the stacktrace of a Span::enter.
Of those samples, 22.8% were inside clone_span.
Collectively, that's 1.8% of total CPU time spent in clone_span as a result of Span::enter.

Screenshot 2025-05-27 at 21 28 44

These numbers don't surprise me. Our hot-loop of our service is a modified copy_bidirectional. Each poll will often be very short, so the Span::enter will take a not insignificant percentage of the time. And since it's only IO that can wake the task, tokio will likely move it around threads depending on which thread happened to be the driver, so any atomic operations (eg a fetch_add) will likely be a cache miss.

The rest of the time of clone_span seems to be within our various layers, we can address those somewhere else.

When looking through the code, I couldn't find a good justification for enter to need clone_span. I understand it might be defensive programming - someone can use the Collect api without using tracing::Span, and they can use it incorrectly. It would thus prevent some panics on exit (a layer using LookupSpan on exit and unwrapping) which would better prevent double-panics on drop handling during a panic. However, I see no unsafe code relying on spans definitely being alive during exit, and when using tracing::Span::enter which I think is the blessed case, the invariant that the span is alive during exit is upheld.

Solution

Remove clone_span from enter, remove try_close from exit.

@conradludgate conradludgate requested review from hawkw, davidbarsky and a team as code owners May 27, 2025 20:33
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. If memory serves, you're correct that this was intended as defensive programming against incorrect "manual" use of the subscriber API.

As you point out, the Span type prevents the span's reference count from going to zero while it's entered, because Span::enter borrows the Span, which owns a clone of the reference count, and it cannot be dropped until it is exited. However, manual use of the lower-level tracing-core APIs could get this wrong and drop all "references" to a span while it is entered. This wouldn't cause memory corruption, since this isn't a "reference count" in the sense of keeping an allocation alive, but it would potentially result in messed up traces, if the span's ID is reused --- events might end up being associated with the wrong span incorrectly.

In practice, though, I'm not sure if it's worthwhile to guard against mistakes in manual use of the lower-level tracing-core enter/exit APIs if it comes at a significant performance cost. There are very few such manual uses of those APIs in the wild --- fewer than I had kind of anticipated. So, maybe it's worth removing this, and expecting that anyone implementing their own Span-like code using the tracing-core API should just be careful to not mess this up. It also eliminates the dispatch::get_default call in the registry's exit method, which is one of my least favorite things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants