enable tokio_unstable broadly #21964

guswynn · 2023-09-25T22:51:52Z

This pr is a followup to incident 70 that broadly enables --cfg=tokio_console and the tokio tracing feature across the codebase. This enables two main things:

Allows us to use the tokio-console in production without special builds
Allows us to expose runtime-level metrics from the runtime, which we can eventually expose to prometheus

in addition to various other nice lil apis, like JoinSet and Handle::id

I did an audit of the tokio codebase to see what this cfg and the tracing feature actually changed about the crate, to ensure that nothing was something unsafe. I found:

Various additional apis that could break in the future.
- We can deal with these easily when we upgrade tokio versions
A new alternative runtime that we won't use.
Some additional runtime options that we MAY want to try in the future, like disable_lifo and deterministic rng seeding
Runtime-level metrics
- As far as I can tell, at least some of these are already collected but not exposed, which means the cpu/memory cost should not be a problem. Some more expensive ones are off by default.
The ability to dump stack traces of tasks. This is behind an additional --cfg flag as it may be expensive and very unstable. Might be neat in the future.
tracing Spans attached to tasks and MOST sync resources

The main concerns here are the cpu/memory cost of metric-collection and all the new tracing. The metrics don't appear expensive, and with a brief look, it seemed like most of it was already collected. The cpu/memory cost of the tracing spans is hard to quantify, but a feature-benchmark run on an earlier version passed.

The other concern is related to the second commit. The commit has more information, but basically, tokio adding additional hidden spans to effectively all sync resources (and tasks) can cause bugs when multiple Dispatchs exist in the same process. This only happens in two locations, which now have scary comments and are audited.

These new tokio span's also may increase the chance that we hit the bug as described in the original version of #20706, but the tracing fast-path test does pass on this pr, so I am inclined to move forward with it. I don't have proof or a complete understanding of that bug, so this is just speculation (cc @pH14 @danhhz)

Motivation

This PR adds a known-desirable feature.

Tips for reviewer

see above

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:

guswynn · 2023-09-27T19:59:41Z

Nightlies seem to pass on this, except Product Limits failed at least once. Look transient, re-trying. I also started a Release Qualification build: https://buildkite.com/materialize/release-qualification/builds/338

guswynn · 2023-09-27T20:10:09Z

src/adapter/src/coord/sequencer/inner.rs

@@ -3181,6 +3191,30 @@ impl Coordinator {
        Ok(Self::send_immediate_rows(rows))
    }

+    /// WARNING, ENTERING SPOOKY ZONE 3.0


@aalexandrov note the new subtlety here; If we switch to the fixed "broken" impl, I believe this goes away

do you have a draft branch for fixing that? I might be able to take a look

I don't have a draft branch. I went as far as trying to save the STACK: OnceLock<Arct<dyn Subscriber + Send + Sync>> before calling .init(), but then got blocked. See my comment towards the end of #21809.

sounds good, will look into this!

guswynn · 2023-09-27T20:14:38Z

Oh, the product limits error may be a stack overflow, ill have to look into that

guswynn · 2023-09-27T20:20:20Z

Increased the coordinator stack size again; I didn't dig into exactly what put us over the edge, but additional tokio span creation could cause it

aalexandrov · 2023-09-27T21:28:29Z

Increased the coordinator stack size again.

I had to do a similar thing yesterday in #21973, although I thought the reason is because I moved around some optimization stages from the plan_~ to the sequence_~ functions.

aalexandrov · 2023-09-27T21:29:49Z

src/adapter/src/coord/sequencer/inner.rs

@@ -3181,6 +3191,30 @@ impl Coordinator {
        Ok(Self::send_immediate_rows(rows))
    }

+    /// WARNING, ENTERING SPOOKY ZONE 3.0


BTW, this is not the only such method. Look for ~_optimizer_pipeline methods in the same file.

aalexandrov · 2023-09-27T21:34:08Z

src/adapter/src/coord/sequencer/inner.rs

+    /// will call a method on the _thread-local_ `Subscriber`, which may be backed with a different
+    /// `Registry`.
+    ///
+    /// At first glance, there is no obvious way this method leaks `Span`s, but ANY tokio


OK, so IIUC the invariant is that any .async methods nested inside ~_optimizer_pipeline methods need to be called with a .with_subscriber(root_dispatch) using a specifically passed rood_dispatch (none of the other ~_optimizer_pipeline has nested .await calls).

aalexandrov · 2023-09-27T21:39:43Z

Increased the coordinator stack size again; I didn't dig into exactly what put us over the edge, but additional tokio span creation could cause it

@guswynn take a look at this diff. I reduced the value from the default (1000) to 950 to make the nightly build happy yesterday.

After the upgrade to the new tokio console it could be that we need to restrict this a bit more.

guswynn · 2023-09-27T21:49:25Z

Product Limits failed again, but not from a stack overflow, but retrying fixed it

benesch · 2023-09-28T04:45:20Z

misc/python/materialize/cli/run.py

@@ -305,9 +305,6 @@ def _build(args: argparse.Namespace, extra_programs: list[str] = []) -> int:
    env = dict(os.environ)
    command = _cargo_command(args, "build")
    features = []
-    if args.tokio_console:
-        features += ["tokio-console"]
-        env["RUSTFLAGS"] = env.get("RUSTFLAGS", "") + " --cfg=tokio_unstable"


Excited to see this go. You missed a spot though!

materialize/misc/python/materialize/cli/run.py

Lines 103 to 107 in eea35ba

parser.add_argument(

"--tokio-console",

help="Activate the Tokio console",

action="store_true",

)

I am leaving this so people don't need to know to pass --tokio-console-listen-addr=127.0.0.1:6669, but can remove in followup if need be

Oh, I think we should enable that listen address by default. No reason not to, right?

It turns off the tracing fast-path optimization as tokio-console traces are trace! level, and may use an arbitrary amount of memory, is that considered reasonable as the default?

Ah, got it, probably good to leave it disabled by default then.

guswynn · 2023-09-28T15:40:19Z

This need to be rebased, as the cargo-test failure should be fixed on main, but the failing feature-benchmark (InsertMultiRow) is a known regression

…imizer_pipeline

danhhz · 2023-09-28T17:56:13Z

Might be causing this on main?

error[E0432]: unresolved import `crate::netio`
  --> src/ore/src/tracing.rs:56:12
   |
56 | use crate::netio::SocketAddr;
   |            ^^^^^ could not find `netio` in the crate root

Makes me wonder, now that we have workspace-hack are the cfg features in mz-ore getting us much?

guswynn · 2023-09-28T18:49:56Z

Makes me wonder, now that we have workspace-hack are the cfg features in mz-ore getting us much?

yeah probably

guswynn force-pushed the tokio-console-default branch 3 times, most recently from 474b320 to 3fefb95 Compare September 27, 2023 17:34

guswynn marked this pull request as ready for review September 27, 2023 19:59

guswynn requested review from benesch and a team as code owners September 27, 2023 19:59

guswynn requested review from a team and ParkMyCar September 27, 2023 19:59

guswynn changed the title ~~enable tokio_console broadly~~ enable tokio_unstable broadly Sep 27, 2023

guswynn requested review from danhhz and bkirwi September 27, 2023 20:06

guswynn commented Sep 27, 2023

View reviewed changes

aalexandrov reviewed Sep 27, 2023

View reviewed changes

benesch approved these changes Sep 28, 2023

View reviewed changes

ParkMyCar approved these changes Sep 28, 2023

View reviewed changes

guswynn added 3 commits September 28, 2023 09:36

tokio: enable tokio_console by default

4200b51

tracing: avoid action-at-a-distance panic caused by explain_query_opt…

72138be

…imizer_pipeline

increase coordinator stack size again

2857f8d

guswynn force-pushed the tokio-console-default branch from eea35ba to 2857f8d Compare September 28, 2023 16:36

guswynn enabled auto-merge (rebase) September 28, 2023 16:38

guswynn merged commit 4a9fcbb into MaterializeInc:main Sep 28, 2023

guswynn deleted the tokio-console-default branch October 2, 2023 21:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable tokio_unstable broadly #21964

enable tokio_unstable broadly #21964

guswynn commented Sep 25, 2023 •

edited

Loading

guswynn commented Sep 27, 2023

guswynn Sep 27, 2023

aalexandrov Sep 27, 2023 •

edited

Loading

guswynn Sep 27, 2023

guswynn commented Sep 27, 2023

guswynn commented Sep 27, 2023

aalexandrov commented Sep 27, 2023

aalexandrov Sep 27, 2023

aalexandrov Sep 27, 2023 •

edited

Loading

guswynn Sep 27, 2023

aalexandrov commented Sep 27, 2023 •

edited

Loading

guswynn commented Sep 27, 2023

benesch Sep 28, 2023

guswynn Sep 28, 2023

benesch Sep 28, 2023

guswynn Sep 28, 2023

benesch Sep 29, 2023

guswynn commented Sep 28, 2023

danhhz commented Sep 28, 2023

guswynn commented Sep 28, 2023

	parser.add_argument(
	"--tokio-console",
	help="Activate the Tokio console",
	action="store_true",
	)

enable tokio_unstable broadly #21964

enable tokio_unstable broadly #21964

Conversation

guswynn commented Sep 25, 2023 • edited Loading

Motivation

Tips for reviewer

Checklist

guswynn commented Sep 27, 2023

Choose a reason for hiding this comment

aalexandrov Sep 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guswynn commented Sep 27, 2023

guswynn commented Sep 27, 2023

aalexandrov commented Sep 27, 2023

Choose a reason for hiding this comment

aalexandrov Sep 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aalexandrov commented Sep 27, 2023 • edited Loading

guswynn commented Sep 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guswynn commented Sep 28, 2023

danhhz commented Sep 28, 2023

guswynn commented Sep 28, 2023

guswynn commented Sep 25, 2023 •

edited

Loading

aalexandrov Sep 27, 2023 •

edited

Loading

aalexandrov Sep 27, 2023 •

edited

Loading

aalexandrov commented Sep 27, 2023 •

edited

Loading