Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable tokio_unstable broadly #21964

Merged
merged 3 commits into from
Sep 28, 2023

Conversation

guswynn
Copy link
Contributor

@guswynn guswynn commented Sep 25, 2023

This pr is a followup to incident 70 that broadly enables --cfg=tokio_console and the tokio tracing feature across the codebase. This enables two main things:

  • Allows us to use the tokio-console in production without special builds
  • Allows us to expose runtime-level metrics from the runtime, which we can eventually expose to prometheus

in addition to various other nice lil apis, like JoinSet and Handle::id

I did an audit of the tokio codebase to see what this cfg and the tracing feature actually changed about the crate, to ensure that nothing was something unsafe. I found:

  • Various additional apis that could break in the future.
    • We can deal with these easily when we upgrade tokio versions
  • A new alternative runtime that we won't use.
  • Some additional runtime options that we MAY want to try in the future, like disable_lifo and deterministic rng seeding
  • Runtime-level metrics
    • As far as I can tell, at least some of these are already collected but not exposed, which means the cpu/memory cost should not be a problem. Some more expensive ones are off by default.
  • The ability to dump stack traces of tasks. This is behind an additional --cfg flag as it may be expensive and very unstable. Might be neat in the future.
  • tracing Spans attached to tasks and MOST sync resources

The main concerns here are the cpu/memory cost of metric-collection and all the new tracing. The metrics don't appear expensive, and with a brief look, it seemed like most of it was already collected. The cpu/memory cost of the tracing spans is hard to quantify, but a feature-benchmark run on an earlier version passed.

The other concern is related to the second commit. The commit has more information, but basically, tokio adding additional hidden spans to effectively all sync resources (and tasks) can cause bugs when multiple Dispatchs exist in the same process. This only happens in two locations, which now have scary comments and are audited.

These new tokio span's also may increase the chance that we hit the bug as described in the original version of #20706, but the tracing fast-path test does pass on this pr, so I am inclined to move forward with it. I don't have proof or a complete understanding of that bug, so this is just speculation (cc @pH14 @danhhz)

Motivation

  • This PR adds a known-desirable feature.

Tips for reviewer

see above

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

@guswynn guswynn force-pushed the tokio-console-default branch 3 times, most recently from 474b320 to 3fefb95 Compare September 27, 2023 17:34
@guswynn
Copy link
Contributor Author

guswynn commented Sep 27, 2023

Nightlies seem to pass on this, except Product Limits failed at least once. Look transient, re-trying. I also started a Release Qualification build: https://buildkite.com/materialize/release-qualification/builds/338

@guswynn guswynn marked this pull request as ready for review September 27, 2023 19:59
@guswynn guswynn requested review from benesch and a team as code owners September 27, 2023 19:59
@guswynn guswynn requested review from a team and ParkMyCar September 27, 2023 19:59
@guswynn guswynn changed the title enable tokio_console broadly enable tokio_unstable broadly Sep 27, 2023
@guswynn guswynn requested review from danhhz and bkirwi September 27, 2023 20:06
@@ -3181,6 +3191,30 @@ impl Coordinator {
Ok(Self::send_immediate_rows(rows))
}

/// WARNING, ENTERING SPOOKY ZONE 3.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aalexandrov note the new subtlety here; If we switch to the fixed "broken" impl, I believe this goes away

do you have a draft branch for fixing that? I might be able to take a look

Copy link
Contributor

@aalexandrov aalexandrov Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a draft branch. I went as far as trying to save the STACK: OnceLock<Arct<dyn Subscriber + Send + Sync>> before calling .init(), but then got blocked. See my comment towards the end of #21809.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, will look into this!

@guswynn
Copy link
Contributor Author

guswynn commented Sep 27, 2023

Oh, the product limits error may be a stack overflow, ill have to look into that

@guswynn
Copy link
Contributor Author

guswynn commented Sep 27, 2023

Increased the coordinator stack size again; I didn't dig into exactly what put us over the edge, but additional tokio span creation could cause it

@aalexandrov
Copy link
Contributor

Increased the coordinator stack size again.

I had to do a similar thing yesterday in #21973, although I thought the reason is because I moved around some optimization stages from the plan_~ to the sequence_~ functions.

@@ -3181,6 +3191,30 @@ impl Coordinator {
Ok(Self::send_immediate_rows(rows))
}

/// WARNING, ENTERING SPOOKY ZONE 3.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this is not the only such method. Look for ~_optimizer_pipeline methods in the same file.

/// will call a method on the _thread-local_ `Subscriber`, which may be backed with a different
/// `Registry`.
///
/// At first glance, there is no obvious way this method leaks `Span`s, but ANY tokio
Copy link
Contributor

@aalexandrov aalexandrov Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so IIUC the invariant is that any .async methods nested inside ~_optimizer_pipeline methods need to be called with a .with_subscriber(root_dispatch) using a specifically passed rood_dispatch (none of the other ~_optimizer_pipeline has nested .await calls).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!

@aalexandrov
Copy link
Contributor

aalexandrov commented Sep 27, 2023

Increased the coordinator stack size again; I didn't dig into exactly what put us over the edge, but additional tokio span creation could cause it

@guswynn take a look at this diff. I reduced the value from the default (1000) to 950 to make the nightly build happy yesterday.

After the upgrade to the new tokio console it could be that we need to restrict this a bit more.

@guswynn
Copy link
Contributor Author

guswynn commented Sep 27, 2023

Product Limits failed again, but not from a stack overflow, but retrying fixed it

@@ -305,9 +305,6 @@ def _build(args: argparse.Namespace, extra_programs: list[str] = []) -> int:
env = dict(os.environ)
command = _cargo_command(args, "build")
features = []
if args.tokio_console:
features += ["tokio-console"]
env["RUSTFLAGS"] = env.get("RUSTFLAGS", "") + " --cfg=tokio_unstable"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited to see this go. You missed a spot though!

parser.add_argument(
"--tokio-console",
help="Activate the Tokio console",
action="store_true",
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaving this so people don't need to know to pass --tokio-console-listen-addr=127.0.0.1:6669, but can remove in followup if need be

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think we should enable that listen address by default. No reason not to, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns off the tracing fast-path optimization as tokio-console traces are trace! level, and may use an arbitrary amount of memory, is that considered reasonable as the default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it, probably good to leave it disabled by default then.

@guswynn
Copy link
Contributor Author

guswynn commented Sep 28, 2023

This need to be rebased, as the cargo-test failure should be fixed on main, but the failing feature-benchmark (InsertMultiRow) is a known regression

@guswynn guswynn force-pushed the tokio-console-default branch from eea35ba to 2857f8d Compare September 28, 2023 16:36
@guswynn guswynn enabled auto-merge (rebase) September 28, 2023 16:38
@guswynn guswynn merged commit 4a9fcbb into MaterializeInc:main Sep 28, 2023
@danhhz
Copy link
Contributor

danhhz commented Sep 28, 2023

Might be causing this on main?

error[E0432]: unresolved import `crate::netio`
  --> src/ore/src/tracing.rs:56:12
   |
56 | use crate::netio::SocketAddr;
   |            ^^^^^ could not find `netio` in the crate root

Makes me wonder, now that we have workspace-hack are the cfg features in mz-ore getting us much?

@guswynn
Copy link
Contributor Author

guswynn commented Sep 28, 2023

Makes me wonder, now that we have workspace-hack are the cfg features in mz-ore getting us much?

yeah probably

@guswynn guswynn deleted the tokio-console-default branch October 2, 2023 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants