Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adapter/persist: Make columns of Materialized Views nullable #29779

Closed

Conversation

ParkMyCar
Copy link
Member

@ParkMyCar ParkMyCar commented Sep 30, 2024

This PR should allow us to use Persist's compare_and_evolve_schema, it does a few things:

  1. Makes all columns of a Materialized View nullable, except those annotated with an ASSERT NOT NULL. This fixes a bug we saw previously where across an upgrade of Materialize columns could change their nullability.
  2. Renamed the existing schemas field in Persist State to deprecated_schemas and added a new schemas field. This should cause us to ignore all of the existing schemas which have incorrect nullability and register new ones.
  3. Updated the Catalog upgrade-check to now also check that the newly parsed items have the same schemas that we have in Persist. This should allow us to catch issues like https://github.com/MaterializeInc/incidents-and-escalations/issues/117 before the new version rolls out.
  4. Added a proptest to make sure these new nullable schemas can still read any structured non-nullable data that has already been written.
  5. Re-added the 'persist_schema_register' and 'persist_schema_require' dyncfgs since we're essentially re-starting the process.

Notably what this PR doesn't do but we talked about previously is also making all of the columns in a Table nullable, because AFAICT that is already the existing behavior. In planning we mark all of the columns nullable, unless they're annotated with NOT NULL or PRIMARY KEY, which seems like exactly our desired end state.

Rollout: Because we're essentially adding a new field I believe we'll need to turn both persist_schema_register and persist_schema_require off. When 0.119.0 finishes rolling out we can turn persist_schema_register on, and then in 0.120.0 we're guaranteed to have all schemas registered for all shards again, at which point we can turn persist_schema_require back on. I might be misunderstanding the flow of how schemas get registered though, so @danhhz I would appreciate your thoughts here.
Update: Chatted with Dan an we shouldn't need to touch persist_schema_require.

Motivation

Fixes https://github.com/MaterializeInc/incidents-and-escalations/issues/117

Tips for reviewers

Everything is separated by commit, so you should be able to review the changes independently.

For commit 2 where I added the 'deprecated_schemas' field, I would appreciate the most thorough review here. I essentially just added the new field where rustc told me to, and I think it's all correct but I'm not totally positive.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@ParkMyCar ParkMyCar requested a review from def- September 30, 2024 02:03
@ParkMyCar ParkMyCar marked this pull request as ready for review September 30, 2024 02:04
@ParkMyCar ParkMyCar requested review from a team as code owners September 30, 2024 02:04
@ParkMyCar ParkMyCar requested review from jkosh44 and removed request for def- September 30, 2024 02:04
@ParkMyCar ParkMyCar marked this pull request as draft September 30, 2024 02:05
@ParkMyCar ParkMyCar changed the title adapter/persist: Make columns of Materialized Views nullable [WIP] adapter/persist: Make columns of Materialized Views nullable Sep 30, 2024
// MIGRATION: We previously registered schemas that have since changed
// nullability. We've made changes to stabilize the schemas and are just
// forgetting all of the old values.
reserved 18;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think we have to do this also for StateDiff. And then once that's done, I worry about getting "state diff does not apply cleanly errors" if we just forget the old fields. It's counter-intuitive I know, but I do think this is less risky if we rename all the old fields, but keep all the diff apply logic for them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense! I figured just bumping this field probably wasn't enough :)

I think then we'll also want to re-introduce the persist_schema_register and persist_schema_require dyncfgs? We'll at least need persist_schema_require because shards will no longer have a schema registered to them, and I think we need persist_schema_register for the typical "adding a new field to State" workflow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably safest to add them both back. Looks like they still exist in LD. I'd probably just pull the old definitions out of git history and then not even bother rolling them out slowly, just default to on

@bkirwi
Copy link
Contributor

bkirwi commented Sep 30, 2024

Heads up that this is different than what I'd expected based on our conversations last week - in my memory we'd decided that tables etc. (and nested records) would also have columns marked nullable, since marking columns non-nullable provides no value but prevents us from ever making that column nullable in the future.

(Totally possible I misunderstood or you just have an intermediate state here, but figured I'd better mention just in case.)

@ParkMyCar ParkMyCar changed the title [WIP] adapter/persist: Make columns of Materialized Views nullable adapter/persist: Make columns of Materialized Views nullable Sep 30, 2024
@ParkMyCar ParkMyCar marked this pull request as ready for review September 30, 2024 15:32
@ParkMyCar ParkMyCar requested review from aljoscha and a team as code owners September 30, 2024 15:32
@ParkMyCar
Copy link
Member Author

Heads up that this is different than what I'd expected based on our conversations last week - in my memory we'd decided that tables etc. (and nested records) would also have columns marked nullable, since marking columns non-nullable provides no value but prevents us from ever making that column nullable in the future.

(Totally possible I misunderstood or you just have an intermediate state here, but figured I'd better mention just in case.)

It's a great callout! Just updated the description with my thinking here, but basically plan_create_table already seems to have our ideal behavior, everything is nullable except columns marked NOT NULL or PRIMARY KEY.

@danhhz
Copy link
Contributor

danhhz commented Sep 30, 2024

Rollout: Because we're essentially adding a new field I believe we'll need to turn both persist_schema_register and persist_schema_require off. When 0.119.0 finishes rolling out we can turn persist_schema_register on, and then in 0.120.0 we're guaranteed to have all schemas registered for all shards again, at which point we can turn persist_schema_require back on. I might be misunderstanding the flow of how schemas get registered though, so @danhhz I would appreciate your thoughts here.

Ah right 🤦! Yeah, we have to default to persist_schema_register off for the initial rollout because of persist forward compatibility. persist_schema_require is inside the check for register, so I think that one can default to on?

Ideally, we'd have instructions for the release eng to flip the flag and restart all the envds during the maintenance window, so we don't have to wait a week to verify that everything works okay

@ParkMyCar
Copy link
Member Author

Ah right 🤦! Yeah, we have to default to persist_schema_register off for the initial rollout because of persist forward compatibility. persist_schema_require is inside the check for register, so I think that one can default to on?

You're totally right! The bit I was scared about is here where we used to assert a schema existed, but since we return a default we should be fine to keep persist_schema_require turned to on.

Ideally, we'd have instructions for the release eng to flip the flag and restart all the envds during the maintenance window, so we don't have to wait a week to verify that everything works okay

Do you think it needs to be a full environment restart, or is the restart of environmentd that happens are part of 0dt enough? Wondering if we can thread the needle here a bit. If not I'm okay with just turning on persist_schema_register after the rollout has completed and just going through the normal flow

@ggevay ggevay self-requested a review October 1, 2024 10:47
@ggevay
Copy link
Contributor

ggevay commented Oct 1, 2024

Note that this will regress the performance of some customer dataflows. Probably not by much, but it's hard to be certain. I'll go through at least the slt plan changes after they are in the diff to see whether there are any big regressions there. If no big regressions there, then this should give us some confidence that there also won't be big regressions at customers.

Copy link
Contributor

@ggevay ggevay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run the slt rewrite locally (commit, feel free to cherry-pick to this PR), and unfortunately there are some blocker plan regressions:

  • (There are many minor regressions where a new Filter ... IS NOT NULL appears. This is just a minor CPU regression, so it's kind of ok.)
  • (There are some minor regressions for variadic outer joins, where a union with a collection that just has a constant null row appears. This is ok.)
  • There is a somewhat significant regression in ldbc query 04 (ldbc_bi.slt), where somehow a new join input appears, with a new ArrangeBy and a new Distinct.
  • There is a catastrophic regression in ldbc query 07, where a new cross join appears. (I classify plan regressions as catastrophic where it would cause a hard to resolve incident, e.g. a user would need to scale up multiple replica sizes, or maybe even no amount of scaling up would help. When a new cross join appears, then often no amount of scaling up helps, because cross joins are not scalable, because they are 100% skewed to one CPU core.)

I'll investigate why exactly these regressions are happening, and try to think whether there is some optimizer tweak to prevent them, but this might be hard. But before that, could you please explain it a bit why this change is necessary? Why is it a problem for Persist if the nullability of an MV column changes across a version upgrade?

If we really need to do this change, then one thing we could do is run an EXPLAIN on all customer plans, and check for significant or catastrophic plan regressions. If we are lucky, maybe there would be only acceptable regressions at customers. But this would be a tedious procedure, so it would be good to just not do this change if we could avoid it somehow.

Edit: The regression in ldbc 07 could be prevented by this optimization: https://github.com/MaterializeInc/database-issues/issues/1312#issuecomment-2368940327

Copy link
Contributor

@jkosh44 jkosh44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adapter parts LGTM, I'll wait for others to officially sign off

Comment on lines +605 to +610
anyhow::bail!(
"found schema mismatch for {}\ncatalog: {:?}\npersist: {:?}",
gid,
catalog_desc,
persist_desc
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, does this need to be redacted?

@ParkMyCar ParkMyCar marked this pull request as draft October 3, 2024 20:09
@ParkMyCar
Copy link
Member Author

Converted to draft because we're going to take a different approach here

@ParkMyCar ParkMyCar closed this Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants