-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable tokio_unstable broadly #21964
Conversation
474b320
to
3fefb95
Compare
Nightlies seem to pass on this, except |
@@ -3181,6 +3191,30 @@ impl Coordinator { | |||
Ok(Self::send_immediate_rows(rows)) | |||
} | |||
|
|||
/// WARNING, ENTERING SPOOKY ZONE 3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aalexandrov note the new subtlety here; If we switch to the fixed "broken" impl, I believe this goes away
do you have a draft branch for fixing that? I might be able to take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a draft branch. I went as far as trying to save the STACK: OnceLock<Arct<dyn Subscriber + Send + Sync>>
before calling .init()
, but then got blocked. See my comment towards the end of #21809.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, will look into this!
Oh, the product limits error may be a stack overflow, ill have to look into that |
Increased the coordinator stack size again; I didn't dig into exactly what put us over the edge, but additional tokio span creation could cause it |
I had to do a similar thing yesterday in #21973, although I thought the reason is because I moved around some optimization stages from the |
@@ -3181,6 +3191,30 @@ impl Coordinator { | |||
Ok(Self::send_immediate_rows(rows)) | |||
} | |||
|
|||
/// WARNING, ENTERING SPOOKY ZONE 3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this is not the only such method. Look for ~_optimizer_pipeline
methods in the same file.
/// will call a method on the _thread-local_ `Subscriber`, which may be backed with a different | ||
/// `Registry`. | ||
/// | ||
/// At first glance, there is no obvious way this method leaks `Span`s, but ANY tokio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so IIUC the invariant is that any .async
methods nested inside ~_optimizer_pipeline
methods need to be called with a .with_subscriber(root_dispatch)
using a specifically passed rood_dispatch
(none of the other ~_optimizer_pipeline
has nested .await
calls).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes!
@guswynn take a look at this diff. I reduced the value from the default (1000) to 950 to make the nightly build happy yesterday. After the upgrade to the new tokio console it could be that we need to restrict this a bit more. |
Product Limits failed again, but not from a stack overflow, but retrying fixed it |
@@ -305,9 +305,6 @@ def _build(args: argparse.Namespace, extra_programs: list[str] = []) -> int: | |||
env = dict(os.environ) | |||
command = _cargo_command(args, "build") | |||
features = [] | |||
if args.tokio_console: | |||
features += ["tokio-console"] | |||
env["RUSTFLAGS"] = env.get("RUSTFLAGS", "") + " --cfg=tokio_unstable" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited to see this go. You missed a spot though!
materialize/misc/python/materialize/cli/run.py
Lines 103 to 107 in eea35ba
parser.add_argument( | |
"--tokio-console", | |
help="Activate the Tokio console", | |
action="store_true", | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am leaving this so people don't need to know to pass --tokio-console-listen-addr=127.0.0.1:6669
, but can remove in followup if need be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think we should enable that listen address by default. No reason not to, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns off the tracing fast-path optimization as tokio-console traces are trace!
level, and may use an arbitrary amount of memory, is that considered reasonable as the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it, probably good to leave it disabled by default then.
This need to be rebased, as the |
eea35ba
to
2857f8d
Compare
Might be causing this on main?
Makes me wonder, now that we have workspace-hack are the cfg features in mz-ore getting us much? |
yeah probably |
This pr is a followup to incident 70 that broadly enables
--cfg=tokio_console
and thetokio
tracing
feature across the codebase. This enables two main things:tokio-console
in production without special buildsin addition to various other nice lil apis, like
JoinSet
andHandle::id
I did an audit of the tokio codebase to see what this
cfg
and thetracing
feature actually changed about the crate, to ensure that nothing was something unsafe. I found:disable_lifo
and deterministic rng seeding--cfg
flag as it may be expensive and very unstable. Might be neat in the future.tracing
Span
s attached to tasks and MOSTsync
resourcesThe main concerns here are the cpu/memory cost of metric-collection and all the new tracing. The metrics don't appear expensive, and with a brief look, it seemed like most of it was already collected. The cpu/memory cost of the tracing spans is hard to quantify, but a feature-benchmark run on an earlier version passed.
The other concern is related to the second commit. The commit has more information, but basically, tokio adding additional hidden spans to effectively all sync resources (and tasks) can cause bugs when multiple
Dispatch
s exist in the same process. This only happens in two locations, which now have scary comments and are audited.These new tokio span's also may increase the chance that we hit the bug as described in the original version of #20706, but the tracing fast-path test does pass on this pr, so I am inclined to move forward with it. I don't have proof or a complete understanding of that bug, so this is just speculation (cc @pH14 @danhhz)
Motivation
Tips for reviewer
see above
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.