-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We can't accurately calculate the critically blocking path for async services. (Should we add previous_sibling_span_id
as an optional span attribute?)
#4079
Comments
fwiw, assuming this functionality doesn't yet exist, i could imagine some fancy build-time sourcemap-like solution that looks at the ast and stores the ordering of relationships between sibling function calls. We would then upload this data for a web ui viewer to use, and draw the red line relationships for us. Prior art: Uploading JS sourcemaps to make stack traces from production readable. Is anyone aware of any projects to do this, or anything else that solves this problem? EDIT: maybe https://opentelemetry.io/docs/concepts/signals/traces/#span-links maybe this is what i want? can we encode this in span links? so specifically, this would be rephrased as "can we get the sdk to auto-populate span links data? akthough jaegertracing/jaeger#4516 seems like even if we did that, it's unclear if it's useful yet with existing open source viewers or not |
FWIW I think this is probably more the domain of telemetry backends than OTel itself. I'm aware of at least one (service now observability, formerly lightstep) that has robust critical path calculations in their trace waterfall. As far as how a backend would do it, I suppose it depends on how pedantic you want to get. In the case of many sibling spans, there is still a critical path, even if fixing it doesn't make the wall-clock time materially faster due to siblings of a similar duration. What you'd want out of that backend is also some way to calculate siblings with a similar duration and highlight them in addition to the critical path. |
Thanks @cartermp! Would be really curious how existing UIs calculate the critical path - the current data provided in spans doesn't seem sufficient to do so accurately as we don't encode causal relationships between sibling spans - only t_start and t_end, which isn't sufficient (hence why I'm wondering if this needs to be part of the spec itself) |
For non-async traces, critical paths are pretty easy to calculate (just walk backwards!) but yes, because there's nothing between peer spans saying that they're peers, you need to "draw" the whole trace and make that observation yourself. I think the issue here is that you cannot calculate a "true and accurate" critical path. It's really that there's a critical path with a block in the middle of it made up of some sibling spans who may or may not be a part of that critical path in another trace. And practically speaking, even if they were consistently placed within a trace, it's often the case that reducing the latency of one of those operations wouldn't materially impact the total latency because a sibling of a similar duration would take its place. My view at least is that this calls for two things: a best-effort critical path calculation and a way to visualize places within that path that are impacted by siblings such that they are not necessarily going to truly be a part of that path in other traces. |
Jaeger has critical path support, you might be interested to take a look into the implementation: Interesting site note: systemd-analyze also has a critical path analysis, you might be interested in that implementation as well: https://www.freedesktop.org/software/systemd/man/latest/systemd-analyze.html |
@magicmark can you describe a situation where sibling spans are blocking each other? |
Consider the following (synchronous) code: @trace
def foo():
bar()
baz()
qux() Assume that i.e. you'd get a trace like this:
In this synchronous example, it's easy to work out the critical path and see what blocks what. But now what about this? @trace
async def foo():
await asyncio.gather(bar(), baz())
qux() Now it starts to get a bit tricky, you'd see something like this:
Now imagine that bar and baz themselves are async methods with their own nested parallelism, and you get a trace like what I posted in the OP. And it becomes very difficult to understand when looking at a large app which functions are the best candidates for optimization by looking at traces. |
Sorry....I think i take this all back actually! As has been pointed out, jaeger shows the critical path (marked by the black line) out of the box when I download the latest version and run it locally. I wrote some sample code to model the trace in the OP, and it totally works: repro code fwiw if anyone is interested
...i think...this is exactly what i'm looking for? Still not sure how this works exactly and accounts for the different possible interpretations of execution order (see OP) but i'll dig in more to understand. (Thanks @GLVSKiriti!) |
Yeah, the way most critical path systems work is they work backwards from the end of the trace and (sometimes via recursion) figure out what's most blocking. Uber has a good post on it here: https://www.uber.com/blog/crisp-critical-path-analysis-for-microservice-architectures/ However, as you're finding, when async spans or "interesting groups of siblings" are concerned, it's not always the same path from trace to trace and there may be some inconsistencies. |
@cartermp thanks! So I haven't fully dug into it yet, but I'm wondering if i'm just running into what I originally suspected which is that the trace data fundamentally doesn't provide enough information to accurately work out the critical path. Working backwards makes sense, but in an async world, you effectively have to "guess" which was the blocking previous span? And sometimes we get it wrong, as seen above? (Minor discrepancies between difference traces for the same page aside, this seems like a reproducible issue) |
previous_sibling_span_id
as an optional span attribute?)
Are you mostly interested in inter-process (inside a single service, no RPCs) analysis? Would profiling be a better solution here? |
@austinlparker yes and no - in our current real life situation, yes, we're mostly focused on the network calls emitted by one particular service. But we might in future also be interested in the critical path of spans throughout a whole set of asynchronous services. I do think it's reasonable to try and solve this with otel and add (profiles are kinda noisy and also don't clearly show you the critical path. e.g. async callbacks in node don't carry the stack information for when/who the parent was.) |
previous_sibling_span_id
as an optional span attribute?)previous_sibling_span_id
as an optional span attribute?)
hey @magicmark, following up on this: can you provide some more details how the solution you envision (adding a |
When analyzing the distributed trace for a page, we often want to understand “what is responsible for making this page slow”?
Specifically – What are the critically blocking spans? This can be quite tricky…
Example
Consider the following simplified trace for a webpage:
We can see here that improving the performance of
foo
won’t immediately fix the overall TTFB since we’d still be blocked onbar
.If we want to draw a “red line” through the trace representing the critically blocking spans, it’s easy to do so here:
All fine and dandy.
Now how about this one? Good luck :)
All of these foo spans are sibling spans. They all share the same parent (a server span of the request into
services/abc
).Consider:
foo5
blocked onfoo2
? Or does it just look that way? Maybe it’s actually blocked onfoo3
, but we don’t have an annotation for the work happening in between.foo7
block onfoo4
orfoo6
? Or both?!In a nodejs environment, or python+asyncio, the above scenario is very real.
Can we solve this today?
No.
(See the edit history for this comment to see my initial thoughts - i've since learned a lot about this problem!)
Tracing UIs may use a hueristic algorithm to calculate and visualize the critical path - but with the current information encoded, this is only safe to do in synchronous services.
For async services, we don't have enough information to accurately work out the critical path where there are concurrent spans. See jaegertracing/jaeger-ui#2371 for a minimal repro of this issue demonstrated with Jaeger.
Why is this the case?
time_start
andtime_end
for each span, but that’s not enough to draw a direct relationship)The ask
I propose that we add something new to the OTel spec (
previous_sibling_span_id
?) to solve this problem. (The implementation of this is trickier - i have a few thoughts here, but wanted to get alignment on the problem itself first.)Thanks!
The text was updated successfully, but these errors were encountered: