adapter/persist: Make columns of Materialized Views nullable #29779

ParkMyCar · 2024-09-30T01:44:37Z

This PR should allow us to use Persist's compare_and_evolve_schema, it does a few things:

Makes all columns of a Materialized View nullable, except those annotated with an ASSERT NOT NULL. This fixes a bug we saw previously where across an upgrade of Materialize columns could change their nullability.
Renamed the existing schemas field in Persist State to deprecated_schemas and added a new schemas field. This should cause us to ignore all of the existing schemas which have incorrect nullability and register new ones.
Updated the Catalog upgrade-check to now also check that the newly parsed items have the same schemas that we have in Persist. This should allow us to catch issues like https://github.com/MaterializeInc/incidents-and-escalations/issues/117 before the new version rolls out.
Added a proptest to make sure these new nullable schemas can still read any structured non-nullable data that has already been written.
Re-added the 'persist_schema_register' and 'persist_schema_require' dyncfgs since we're essentially re-starting the process.

Notably what this PR doesn't do but we talked about previously is also making all of the columns in a Table nullable, because AFAICT that is already the existing behavior. In planning we mark all of the columns nullable, unless they're annotated with NOT NULL or PRIMARY KEY, which seems like exactly our desired end state.

Rollout: Because we're essentially adding a new field I believe we'll need to turn ~~both~~ persist_schema_register ~~and persist_schema_require~~ off. When 0.119.0 finishes rolling out we can turn persist_schema_register on, and then in 0.120.0 we're guaranteed to have all schemas registered for all shards again, ~~at which point we can turn persist_schema_require back on~~. I might be misunderstanding the flow of how schemas get registered though, so @danhhz I would appreciate your thoughts here.
Update: Chatted with Dan an we shouldn't need to touch persist_schema_require.

Motivation

Fixes https://github.com/MaterializeInc/incidents-and-escalations/issues/117

Tips for reviewers

Everything is separated by commit, so you should be able to review the changes independently.

For commit 2 where I added the 'deprecated_schemas' field, I would appreciate the most thorough review here. I essentially just added the new field where rustc told me to, and I think it's all correct but I'm not totally positive.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

danhhz · 2024-09-30T14:47:04Z

src/persist-client/src/internal/state.proto

+    // MIGRATION: We previously registered schemas that have since changed
+    // nullability. We've made changes to stabilize the schemas and are just
+    // forgetting all of the old values.
+    reserved 18;


Hmm, I think we have to do this also for StateDiff. And then once that's done, I worry about getting "state diff does not apply cleanly errors" if we just forget the old fields. It's counter-intuitive I know, but I do think this is less risky if we rename all the old fields, but keep all the diff apply logic for them

That makes sense! I figured just bumping this field probably wasn't enough :)

I think then we'll also want to re-introduce the persist_schema_register and persist_schema_require dyncfgs? We'll at least need persist_schema_require because shards will no longer have a schema registered to them, and I think we need persist_schema_register for the typical "adding a new field to State" workflow?

Yeah, probably safest to add them both back. Looks like they still exist in LD. I'd probably just pull the old definitions out of git history and then not even bother rolling them out slowly, just default to on

bkirwi · 2024-09-30T15:02:21Z

Heads up that this is different than what I'd expected based on our conversations last week - in my memory we'd decided that tables etc. (and nested records) would also have columns marked nullable, since marking columns non-nullable provides no value but prevents us from ever making that column nullable in the future.

(Totally possible I misunderstood or you just have an intermediate state here, but figured I'd better mention just in case.)

…cation

This reverts commit 9b25df1.

ParkMyCar · 2024-09-30T15:35:42Z

Heads up that this is different than what I'd expected based on our conversations last week - in my memory we'd decided that tables etc. (and nested records) would also have columns marked nullable, since marking columns non-nullable provides no value but prevents us from ever making that column nullable in the future.

(Totally possible I misunderstood or you just have an intermediate state here, but figured I'd better mention just in case.)

It's a great callout! Just updated the description with my thinking here, but basically plan_create_table already seems to have our ideal behavior, everything is nullable except columns marked NOT NULL or PRIMARY KEY.

danhhz · 2024-09-30T15:41:17Z

Rollout: Because we're essentially adding a new field I believe we'll need to turn both persist_schema_register and persist_schema_require off. When 0.119.0 finishes rolling out we can turn persist_schema_register on, and then in 0.120.0 we're guaranteed to have all schemas registered for all shards again, at which point we can turn persist_schema_require back on. I might be misunderstanding the flow of how schemas get registered though, so @danhhz I would appreciate your thoughts here.

Ah right 🤦! Yeah, we have to default to persist_schema_register off for the initial rollout because of persist forward compatibility. persist_schema_require is inside the check for register, so I think that one can default to on?

Ideally, we'd have instructions for the release eng to flip the flag and restart all the envds during the maintenance window, so we don't have to wait a week to verify that everything works okay

ParkMyCar · 2024-09-30T15:53:52Z

Ah right 🤦! Yeah, we have to default to persist_schema_register off for the initial rollout because of persist forward compatibility. persist_schema_require is inside the check for register, so I think that one can default to on?

You're totally right! The bit I was scared about is here where we used to assert a schema existed, but since we return a default we should be fine to keep persist_schema_require turned to on.

Ideally, we'd have instructions for the release eng to flip the flag and restart all the envds during the maintenance window, so we don't have to wait a week to verify that everything works okay

Do you think it needs to be a full environment restart, or is the restart of environmentd that happens are part of 0dt enough? Wondering if we can thread the needle here a bit. If not I'm okay with just turning on persist_schema_register after the rollout has completed and just going through the normal flow

ggevay · 2024-10-01T10:52:19Z

Note that this will regress the performance of some customer dataflows. Probably not by much, but it's hard to be certain. I'll go through at least the slt plan changes after they are in the diff to see whether there are any big regressions there. If no big regressions there, then this should give us some confidence that there also won't be big regressions at customers.

ggevay

I've run the slt rewrite locally (commit, feel free to cherry-pick to this PR), and unfortunately there are some blocker plan regressions:

(There are many minor regressions where a new Filter ... IS NOT NULL appears. This is just a minor CPU regression, so it's kind of ok.)
(There are some minor regressions for variadic outer joins, where a union with a collection that just has a constant null row appears. This is ok.)
There is a somewhat significant regression in ldbc query 04 (ldbc_bi.slt), where somehow a new join input appears, with a new ArrangeBy and a new Distinct.
There is a catastrophic regression in ldbc query 07, where a new cross join appears. (I classify plan regressions as catastrophic where it would cause a hard to resolve incident, e.g. a user would need to scale up multiple replica sizes, or maybe even no amount of scaling up would help. When a new cross join appears, then often no amount of scaling up helps, because cross joins are not scalable, because they are 100% skewed to one CPU core.)

I'll investigate why exactly these regressions are happening, and try to think whether there is some optimizer tweak to prevent them, but this might be hard. But before that, could you please explain it a bit why this change is necessary? Why is it a problem for Persist if the nullability of an MV column changes across a version upgrade?

If we really need to do this change, then one thing we could do is run an EXPLAIN on all customer plans, and check for significant or catastrophic plan regressions. If we are lucky, maybe there would be only acceptable regressions at customers. But this would be a tedious procedure, so it would be good to just not do this change if we could avoid it somehow.

Edit: The regression in ldbc 07 could be prevented by this optimization: https://github.com/MaterializeInc/database-issues/issues/1312#issuecomment-2368940327

jkosh44

Adapter parts LGTM, I'll wait for others to officially sign off

jkosh44 · 2024-10-03T17:46:26Z

src/catalog-debug/src/main.rs

+            anyhow::bail!(
+                "found schema mismatch for {}\ncatalog: {:?}\npersist: {:?}",
+                gid,
+                catalog_desc,
+                persist_desc
+            );


Just checking, does this need to be redacted?

ParkMyCar · 2024-10-03T20:09:18Z

Converted to draft because we're going to take a different approach here

ParkMyCar requested a review from def- September 30, 2024 02:03

ParkMyCar marked this pull request as ready for review September 30, 2024 02:04

ParkMyCar requested review from a team as code owners September 30, 2024 02:04

ParkMyCar requested review from jkosh44 and removed request for def- September 30, 2024 02:04

ParkMyCar marked this pull request as draft September 30, 2024 02:05

ParkMyCar changed the title ~~adapter/persist: Make columns of Materialized Views nullable~~ [WIP] adapter/persist: Make columns of Materialized Views nullable Sep 30, 2024

start, make columns of a matview nullable

4ecf012

danhhz reviewed Sep 30, 2024

View reviewed changes

ParkMyCar added 5 commits September 30, 2024 11:16

add 'deprecated_schemas' to relevant State structs to faciliate depre…

63184da

…cation

update the catalog upgrade check to make sure schemas match persist

ac9e2ce

add proptest to make sure data can be decoded by nullable schema

290377d

Revert "persist: remove schema reg dyncfgs"

7a67456

This reverts commit 9b25df1.

bin/fmt

234a2aa

ParkMyCar force-pushed the adapter/all-mat-views-nullable branch from 458c1a4 to 234a2aa Compare September 30, 2024 15:22

ParkMyCar changed the title ~~[WIP] adapter/persist: Make columns of Materialized Views nullable~~ adapter/persist: Make columns of Materialized Views nullable Sep 30, 2024

ParkMyCar marked this pull request as ready for review September 30, 2024 15:32

ParkMyCar requested review from aljoscha and a team as code owners September 30, 2024 15:32

ggevay self-requested a review October 1, 2024 10:47

ggevay requested changes Oct 1, 2024

View reviewed changes

jkosh44 reviewed Oct 3, 2024

View reviewed changes

ParkMyCar marked this pull request as draft October 3, 2024 20:09

ParkMyCar closed this Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adapter/persist: Make columns of Materialized Views nullable #29779

adapter/persist: Make columns of Materialized Views nullable #29779

ParkMyCar commented Sep 30, 2024 •

edited

Loading

danhhz Sep 30, 2024

ParkMyCar Sep 30, 2024

danhhz Sep 30, 2024

bkirwi commented Sep 30, 2024

ParkMyCar commented Sep 30, 2024

danhhz commented Sep 30, 2024

ParkMyCar commented Sep 30, 2024

ggevay commented Oct 1, 2024 •

edited

Loading

ggevay left a comment •

edited

Loading

jkosh44 left a comment

jkosh44 Oct 3, 2024

ParkMyCar commented Oct 3, 2024

adapter/persist: Make columns of Materialized Views nullable #29779

adapter/persist: Make columns of Materialized Views nullable #29779

Conversation

ParkMyCar commented Sep 30, 2024 • edited Loading

Motivation

Tips for reviewers

Checklist

danhhz Sep 30, 2024

Choose a reason for hiding this comment

ParkMyCar Sep 30, 2024

Choose a reason for hiding this comment

danhhz Sep 30, 2024

Choose a reason for hiding this comment

bkirwi commented Sep 30, 2024

ParkMyCar commented Sep 30, 2024

danhhz commented Sep 30, 2024

ParkMyCar commented Sep 30, 2024

ggevay commented Oct 1, 2024 • edited Loading

ggevay left a comment • edited Loading

Choose a reason for hiding this comment

jkosh44 left a comment

Choose a reason for hiding this comment

jkosh44 Oct 3, 2024

Choose a reason for hiding this comment

ParkMyCar commented Oct 3, 2024

ParkMyCar commented Sep 30, 2024 •

edited

Loading

ggevay commented Oct 1, 2024 •

edited

Loading

ggevay left a comment •

edited

Loading