-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adapter/persist: Make columns of Materialized Views nullable #29779
adapter/persist: Make columns of Materialized Views nullable #29779
Conversation
// MIGRATION: We previously registered schemas that have since changed | ||
// nullability. We've made changes to stabilize the schemas and are just | ||
// forgetting all of the old values. | ||
reserved 18; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think we have to do this also for StateDiff. And then once that's done, I worry about getting "state diff does not apply cleanly errors" if we just forget the old fields. It's counter-intuitive I know, but I do think this is less risky if we rename all the old fields, but keep all the diff apply logic for them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense! I figured just bumping this field probably wasn't enough :)
I think then we'll also want to re-introduce the persist_schema_register
and persist_schema_require
dyncfgs? We'll at least need persist_schema_require
because shards will no longer have a schema registered to them, and I think we need persist_schema_register
for the typical "adding a new field to State" workflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably safest to add them both back. Looks like they still exist in LD. I'd probably just pull the old definitions out of git history and then not even bother rolling them out slowly, just default to on
Heads up that this is different than what I'd expected based on our conversations last week - in my memory we'd decided that tables etc. (and nested records) would also have columns marked nullable, since marking columns non-nullable provides no value but prevents us from ever making that column nullable in the future. (Totally possible I misunderstood or you just have an intermediate state here, but figured I'd better mention just in case.) |
This reverts commit 9b25df1.
458c1a4
to
234a2aa
Compare
It's a great callout! Just updated the description with my thinking here, but basically |
Ah right 🤦! Yeah, we have to default to Ideally, we'd have instructions for the release eng to flip the flag and restart all the envds during the maintenance window, so we don't have to wait a week to verify that everything works okay |
You're totally right! The bit I was scared about is here where we used to assert a schema existed, but since we return a default we should be fine to keep
Do you think it needs to be a full environment restart, or is the restart of |
Note that this will regress the performance of some customer dataflows. Probably not by much, but it's hard to be certain. I'll go through at least the slt plan changes after they are in the diff to see whether there are any big regressions there. If no big regressions there, then this should give us some confidence that there also won't be big regressions at customers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've run the slt rewrite locally (commit, feel free to cherry-pick to this PR), and unfortunately there are some blocker plan regressions:
- (There are many minor regressions where a new
Filter ... IS NOT NULL
appears. This is just a minor CPU regression, so it's kind of ok.) - (There are some minor regressions for variadic outer joins, where a union with a collection that just has a constant null row appears. This is ok.)
- There is a somewhat significant regression in ldbc query 04 (
ldbc_bi.slt
), where somehow a new join input appears, with a newArrangeBy
and a newDistinct
. - There is a catastrophic regression in ldbc query 07, where a new cross join appears. (I classify plan regressions as catastrophic where it would cause a hard to resolve incident, e.g. a user would need to scale up multiple replica sizes, or maybe even no amount of scaling up would help. When a new cross join appears, then often no amount of scaling up helps, because cross joins are not scalable, because they are 100% skewed to one CPU core.)
I'll investigate why exactly these regressions are happening, and try to think whether there is some optimizer tweak to prevent them, but this might be hard. But before that, could you please explain it a bit why this change is necessary? Why is it a problem for Persist if the nullability of an MV column changes across a version upgrade?
If we really need to do this change, then one thing we could do is run an EXPLAIN on all customer plans, and check for significant or catastrophic plan regressions. If we are lucky, maybe there would be only acceptable regressions at customers. But this would be a tedious procedure, so it would be good to just not do this change if we could avoid it somehow.
Edit: The regression in ldbc 07 could be prevented by this optimization: https://github.com/MaterializeInc/database-issues/issues/1312#issuecomment-2368940327
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adapter parts LGTM, I'll wait for others to officially sign off
anyhow::bail!( | ||
"found schema mismatch for {}\ncatalog: {:?}\npersist: {:?}", | ||
gid, | ||
catalog_desc, | ||
persist_desc | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking, does this need to be redacted?
Converted to draft because we're going to take a different approach here |
This PR should allow us to use Persist's
compare_and_evolve_schema
, it does a few things:ASSERT NOT NULL
. This fixes a bug we saw previously where across an upgrade of Materialize columns could change their nullability.schemas
field in Persist State todeprecated_schemas
and added a newschemas
field. This should cause us to ignore all of the existing schemas which have incorrect nullability and register new ones.proptest
to make sure these new nullable schemas can still read any structured non-nullable data that has already been written.Notably what this PR doesn't do but we talked about previously is also making all of the columns in a Table nullable, because AFAICT that is already the existing behavior. In planning we mark all of the columns nullable, unless they're annotated with
NOT NULL
orPRIMARY KEY
, which seems like exactly our desired end state.Rollout: Because we're essentially adding a new field I believe we'll need to turn
bothpersist_schema_register
andoff. When 0.119.0 finishes rolling out we can turnpersist_schema_require
persist_schema_register
on, and then in 0.120.0 we're guaranteed to have all schemas registered for all shards again,at which point we can turn. I might be misunderstanding the flow of how schemas get registered though, so @danhhz I would appreciate your thoughts here.persist_schema_require
back onUpdate: Chatted with Dan an we shouldn't need to touch
persist_schema_require
.Motivation
Fixes https://github.com/MaterializeInc/incidents-and-escalations/issues/117
Tips for reviewers
Everything is separated by commit, so you should be able to review the changes independently.
For commit 2 where I added the 'deprecated_schemas' field, I would appreciate the most thorough review here. I essentially just added the new field where
rustc
told me to, and I think it's all correct but I'm not totally positive.Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.