-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resolve tracing span sometimes lingers even after response received #2435
Comments
I haven't followed the code to verify this, but a quick guess is this: when a request is started, the pool that reqwest uses will create two futures, and race them: 1) ask the pool for a newly idle connection, 2) create a new connection. If the second option is started, but then the first option wins the race (especially because resolving looks like its being slow), the second future will be spawned as a background task to finish up and insert the new connection into the pool (this reduces socket thrashing). Perhaps that means the span around resolving outlives the original request span? |
Yeah, the guy that answered the issue in the tracing repo thought the same thing about the span outliving the request span |
Is it something we could convince you to take a look at? |
It's not something I can dig into myself, but others are welcome. The relevant span is here: https://github.com/hyperium/hyper-util/blob/fcb8565bb8090d90349b04b2caa9c8128a8ae59f/src/client/legacy/connect/dns.rs#L121 I don't know if there's a mechanism to disconnect a span from it's parent after creation, or some other solution. |
Not after creation. Parent child relationships are forever and all parents live as long as their children. |
I thought that maybe a simple tracing subscriber filter could filter out that span? At least that was my next idea |
Nvm, using a subscriber filter won't work since the span is created with |
Perhaps the |
Looking at the docs for |
This seems like a possible solution: hyperium/hyper-util#153 |
Looks good to me! |
The new hyper-util v0.1.10 is out, but we're still seeing panics from tracing that get resolved when downgrading it to v0.1.7. Previously I was under the impression that this behavior was a combination of tokio-rs/tracing#2870 and this bug (since the version it started happening with matches exactly), now I'm not so certain anymore. @SorenHolstHansen Does the original bug still reproduce with hyper-util v0.1.10? |
Hmmm, well the behavior is different now, but there still seems to be something off. Now, when I run the repro case, I don't get cases where the I tried to then filter out the resolve by setting an env-filter of i.e. I can't filter I don't know if that is related to your panics @svix-jplatte, or it's just a simple oversight from me? |
If this is still causing panics, we can yank it out. It was meant to be helpful, not crash 🙈 |
Well I'm pretty sure it's not (entirely) |
Yea, I'm certain it's a problem in tracing, hyper is doing something reasonable. But, the benefit that change was supposed to provide is not worth causing crashes. Let's revert. |
We have noticed a weird situation in our production traces for our axum server where sometimes an endpoint, even if the endpoint makes a response very quickly, that the traces say that the endpoint took a very long time. This doesn't happen all the time, but when it happens, it is always related to reqwest calls (i.e. the reqwest crate).
A trace could look something like this (this is how it looks in our prod metrics platform in azure, you will see below that another platform shows the endpoint as shorter)
that is, the response is done quite quickly, but because the reqwest call seems to hang, the whole endpoint shows as taking a long time. Now, the reqwest calls actually all finish very quickly. You can see that the "stuff_after" continues shortly after, and we have done other approaches to show that they actually finish fast. So something is hanging in the reqwest client that makes this show as taking a long time.
We even added a field to the root request span (using tower tracing) that shows that the endpoint is done quickly. So from the users point of view, nothing is taking a long time, but for us, we can't trust our tracing.
I made a reproducible case here: https://github.com/SorenHolstHansen/reqwest_tracing_bug
To reproduce, I started a local open telemetry platform (signoz or jaeger), started the server in the repo i linked, and then ran
seq 1 5000 | xargs -P0 -I{} curl http://127.0.0.1:8008
In signoz, you should then be able to see cases like these:
You can see in the bottom right, that the endpoint took 0 seconds, but resolve took 35 seconds.
I created a similar issue earlier in the tracing crate, which led me to believe that reqwest has long lived spans.
The text was updated successfully, but these errors were encountered: