Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resolve tracing span sometimes lingers even after response received #2435

Open
SorenHolstHansen opened this issue Sep 26, 2024 · 16 comments
Open

Comments

@SorenHolstHansen
Copy link

We have noticed a weird situation in our production traces for our axum server where sometimes an endpoint, even if the endpoint makes a response very quickly, that the traces say that the endpoint took a very long time. This doesn't happen all the time, but when it happens, it is always related to reqwest calls (i.e. the reqwest crate).

A trace could look something like this (this is how it looks in our prod metrics platform in azure, you will see below that another platform shows the endpoint as shorter)

endpoint:     -------------
something:    --
reqwest_call:   -----------
stuff_after:       --
response:           --

that is, the response is done quite quickly, but because the reqwest call seems to hang, the whole endpoint shows as taking a long time. Now, the reqwest calls actually all finish very quickly. You can see that the "stuff_after" continues shortly after, and we have done other approaches to show that they actually finish fast. So something is hanging in the reqwest client that makes this show as taking a long time.

We even added a field to the root request span (using tower tracing) that shows that the endpoint is done quickly. So from the users point of view, nothing is taking a long time, but for us, we can't trust our tracing.

I made a reproducible case here: https://github.com/SorenHolstHansen/reqwest_tracing_bug

To reproduce, I started a local open telemetry platform (signoz or jaeger), started the server in the repo i linked, and then ran seq 1 5000 | xargs -P0 -I{} curl http://127.0.0.1:8008

In signoz, you should then be able to see cases like these:
Screenshot 2024-09-26 at 12 18 23

You can see in the bottom right, that the endpoint took 0 seconds, but resolve took 35 seconds.

I created a similar issue earlier in the tracing crate, which led me to believe that reqwest has long lived spans.

@seanmonstar
Copy link
Owner

I haven't followed the code to verify this, but a quick guess is this: when a request is started, the pool that reqwest uses will create two futures, and race them: 1) ask the pool for a newly idle connection, 2) create a new connection. If the second option is started, but then the first option wins the race (especially because resolving looks like its being slow), the second future will be spawned as a background task to finish up and insert the new connection into the pool (this reduces socket thrashing).

Perhaps that means the span around resolving outlives the original request span?

@SorenHolstHansen
Copy link
Author

Yeah, the guy that answered the issue in the tracing repo thought the same thing about the span outliving the request span

@SorenHolstHansen
Copy link
Author

Is it something we could convince you to take a look at?

@seanmonstar
Copy link
Owner

It's not something I can dig into myself, but others are welcome. The relevant span is here: https://github.com/hyperium/hyper-util/blob/fcb8565bb8090d90349b04b2caa9c8128a8ae59f/src/client/legacy/connect/dns.rs#L121

I don't know if there's a mechanism to disconnect a span from it's parent after creation, or some other solution.

@mladedav
Copy link

Not after creation. Parent child relationships are forever and all parents live as long as their children.

@SorenHolstHansen
Copy link
Author

I thought that maybe a simple tracing subscriber filter could filter out that span? At least that was my next idea

@SorenHolstHansen
Copy link
Author

SorenHolstHansen commented Sep 30, 2024

Nvm, using a subscriber filter won't work since the span is created with .or_current() (see here), so can't disable it

@seanmonstar
Copy link
Owner

Perhaps the .or_current() could be removed, I'm not sure why it's there, I don't usually see that when declaring spans.

@SorenHolstHansen
Copy link
Author

Looking at the docs for .or_current() it looks like it is the recommended way for that kind of situation, perhaps it would be better if the span had no parent, but .follows_from() its parent instead. Not an expert though, so perhaps @mladedav has some insights

@seanmonstar
Copy link
Owner

This seems like a possible solution: hyperium/hyper-util#153

@seanmonstar seanmonstar changed the title reqwest seems to hang around after instrumented function is done resolve tracing span sometimes lingers even after response received Oct 1, 2024
@SorenHolstHansen
Copy link
Author

Looks good to me!

@svix-jplatte
Copy link

svix-jplatte commented Nov 1, 2024

The new hyper-util v0.1.10 is out, but we're still seeing panics from tracing that get resolved when downgrading it to v0.1.7. Previously I was under the impression that this behavior was a combination of tokio-rs/tracing#2870 and this bug (since the version it started happening with matches exactly), now I'm not so certain anymore.

@SorenHolstHansen Does the original bug still reproduce with hyper-util v0.1.10?

@SorenHolstHansen
Copy link
Author

SorenHolstHansen commented Nov 4, 2024

Hmmm, well the behavior is different now, but there still seems to be something off.

Now, when I run the repro case, I don't get cases where the http_request takes a short time, but the resolve takes a long time. Now, the http_request says it takes a long time as well when resolve does, i.e. see for instance this image

Screenshot 2024-11-04 at 09 47 59

I tried to then filter out the resolve by setting an env-filter of trace,hyper-util=off,hyper_util=off, but then I just got this
Screenshot 2024-11-04 at 09 46 22

i.e. I can't filter resolve away properly, which seems like a bug too, though I don't know where.

I don't know if that is related to your panics @svix-jplatte, or it's just a simple oversight from me?

@seanmonstar
Copy link
Owner

If this is still causing panics, we can yank it out. It was meant to be helpful, not crash 🙈

@svix-jplatte
Copy link

svix-jplatte commented Nov 4, 2024

Well I'm pretty sure it's not (entirely) hyper-util's fault. It seems to only happen with some async tasks using a non-default tracing subscriber / dispatcher, so I think the tracing bug I linked is also involved.

@seanmonstar
Copy link
Owner

Yea, I'm certain it's a problem in tracing, hyper is doing something reasonable.

But, the benefit that change was supposed to provide is not worth causing crashes. Let's revert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants