From eb71f7ce5474cee6d7124e5d2faf34857fb23609 Mon Sep 17 00:00:00 2001 From: Cara Haas Date: Mon, 2 Oct 2023 17:45:49 -0400 Subject: [PATCH 1/5] first draft of cluster ux long term vision document --- .../design/20231002_cluster_vision.md | 145 ++++++++++++++++++ 1 file changed, 145 insertions(+) create mode 100644 doc/developer/design/20231002_cluster_vision.md diff --git a/doc/developer/design/20231002_cluster_vision.md b/doc/developer/design/20231002_cluster_vision.md new file mode 100644 index 000000000000..4987b1cdd5cb --- /dev/null +++ b/doc/developer/design/20231002_cluster_vision.md @@ -0,0 +1,145 @@ +# Cluster UX Long Term Vision + +- Associated: [Epic](https://github.com/MaterializeInc/materialize/issues/22120) + + + +## The Problem +We need a documented vision for the cluster UX in the long term which covers both +the "end state" goal as well as the short and medium states in order to: +* Make product prioritization decisions around cluster work +* Communicate to customers what to expect around cluster management +* Set expectations for other projects on how they should be interacting with clusters + +Epic: https://github.com/MaterializeInc/materialize/issues/22120 + +## Success Criteria +Primarily, a merged design doc that is reviewed and approved by EPD leadership, +and is socialized to GTM. + +Secondarily, a roadmap for cluster work for the next quarter. + +Qualitatively, positive feedback from EPD leadership and GTM folks that they +have clarity [TODO(chaas) define this more explicitly]. + +## Out of Scope +Designing the actual cluster API changes themselves, or proposing implementation details. + +## Solution Proposal +The objectives we are striving for with the cluster UX: +* Easy to use and manage +* Maximize resource efficiency/minimize unused resource cost +* Enable fault tolerance/use-case isolation + +### Declarative vs Imperative +We should move toward a declarative API for managing clusters, where: + +Declarative is like `CREATE CLUSTER` with managed replicas and \ +Imperative is like `CREATE/DROP CLUSTER REPLICA`. + +This means deprecating manual cluster replica management. \ +We believe this is easier to use and manage. + +The primary work item for this is **graceful rehydration**. At the moment, a change in size causes downtime until the new replicas are hydrated. As such, customers still want the flexibility to create their own replicas for graceful resizing. We can avoid this by leaving a subset of the original replicas around until the new replicas are hydrated. \ +This requires us to 1) detect when hydration is complete and 2) trigger database object changes based on this event (without/based on an earlier DDL statement). + +Another consideration is internal use-cases, such as unbilled replicas. We may want to keep around an imperitive API for internal (support) use only. + +To be determined: whether replica sets fits into this model, either externally exposed or internal-only. Perhaps they are a way we could recover clusters with heterogeneous replicas while retaining a declarative API. + +### Resource usage +The very long-term goal is clusterless Materialize, where Materialize does automatic workload scheduling for the customer. + +An intermediary solution, which is also far off is autoscaling of clusters, where Materialize automatically resizes clusters based on the observed workload. + +A more achievable offering in the short-term is automatic shutdown of clusters, where Materialize can spin down a cluster to 0 replicas based on certain criteria, such as a scheduled time or amount of idle time. \ +This would reduce resource waste for development clusters. The triggering mechanism from graceful rehydration is also a requirement here. + +### Data model +We should move toward prescriptive guidance on how users should configure their clusters with respect to databases and schemas, \ +e.g. should clusters typically be scoped to a single schema. + +We should also be more prescriptive about what data should be colocated, \ +e.g. when should the user create a new cluster for their new sources/MVs/indexes versus increase the size of their existing cluster. + +We believe this will make it clearer how to achieve appropriate fault tolerance and maxmimize resource efficiency. + +### Support & testing +Support is able to create create unbilled or partially billed cluster resources for resolving customer issues. This is soon to be possible via unbilled replicas [#20317](https://github.com/MaterializeInc/materialize/issues/20317). + +Engineering is also able to create additional unbilled shadow replicas for testing new features and query plan changes, which do not serve customers' production workflows. + +### Roadmap +**Now** +* @antiguru to complete `ALTER...SET CLUSTER` [#20841](https://github.com/MaterializeInc/materialize/issues/20841), without graceful rehydration. +* @antiguru to continue in-flight work on multipurpose clusters [#17413](https://github.com/MaterializeInc/materialize/issues/17413) - TODO(@antiguru): fill in details. +* @ggnall to do discovery on the prescriptive data model as part of Blue/Green deployments project [#19748](https://github.com/MaterializeInc/materialize/issues/19748) + +**Next** +* Graceful rehydration, to support graceful manual execution of `ALTER...SET CLUSTER` and `ALTER...SET SIZE`. +* Deprecate `CREATE/DROP CLUSTER REPLICA` for users. + +**Later** +* Auto-shutdown of clusters. +* Shadow replicas. + +**Much Later** +* Autoscaling clusters / clusterless. + +## Minimal Viable Prototype + + + +## Alternatives + + + +## Open questions + + From ebf29dcbfd8fc6bf93d945787ab6d0ab2de76ab7 Mon Sep 17 00:00:00 2001 From: Cara Haas Date: Tue, 3 Oct 2023 21:34:36 -0400 Subject: [PATCH 2/5] revisions based on review feedback from moritz + chat with jessica --- .../design/20231002_cluster_vision.md | 22 +++++-------------- 1 file changed, 5 insertions(+), 17 deletions(-) diff --git a/doc/developer/design/20231002_cluster_vision.md b/doc/developer/design/20231002_cluster_vision.md index 4987b1cdd5cb..b0196198209e 100644 --- a/doc/developer/design/20231002_cluster_vision.md +++ b/doc/developer/design/20231002_cluster_vision.md @@ -2,23 +2,13 @@ - Associated: [Epic](https://github.com/MaterializeInc/materialize/issues/22120) - - ## The Problem We need a documented vision for the cluster UX in the long term which covers both the "end state" goal as well as the short and medium states in order to: +* Ensure alignment in the future that we are working toward * Make product prioritization decisions around cluster work -* Communicate to customers what to expect around cluster management -* Set expectations for other projects on how they should be interacting with clusters +* Make folks more comfortable accepting intermediate states that aren't ideal in service of a greater goal +* Come up with a narrative for customers on what to expect around cluster management Epic: https://github.com/MaterializeInc/materialize/issues/22120 @@ -29,7 +19,7 @@ and is socialized to GTM. Secondarily, a roadmap for cluster work for the next quarter. Qualitatively, positive feedback from EPD leadership and GTM folks that they -have clarity [TODO(chaas) define this more explicitly]. +have clarity on the vision and roadmap, and the reasoning behind those decisions. ## Out of Scope Designing the actual cluster API changes themselves, or proposing implementation details. @@ -49,7 +39,7 @@ Imperative is like `CREATE/DROP CLUSTER REPLICA`. This means deprecating manual cluster replica management. \ We believe this is easier to use and manage. -The primary work item for this is **graceful rehydration**. At the moment, a change in size causes downtime until the new replicas are hydrated. As such, customers still want the flexibility to create their own replicas for graceful resizing. We can avoid this by leaving a subset of the original replicas around until the new replicas are hydrated. \ +The primary work item for this is **graceful reconfiguration**. At the moment, a change in size causes downtime until the new replicas are hydrated. As such, customers still want the flexibility to create their own replicas for graceful resizing. We can avoid this by leaving a subset of the original replicas around until the new replicas are hydrated. \ This requires us to 1) detect when hydration is complete and 2) trigger database object changes based on this event (without/based on an earlier DDL statement). Another consideration is internal use-cases, such as unbilled replicas. We may want to keep around an imperitive API for internal (support) use only. @@ -76,8 +66,6 @@ We believe this will make it clearer how to achieve appropriate fault tolerance ### Support & testing Support is able to create create unbilled or partially billed cluster resources for resolving customer issues. This is soon to be possible via unbilled replicas [#20317](https://github.com/MaterializeInc/materialize/issues/20317). -Engineering is also able to create additional unbilled shadow replicas for testing new features and query plan changes, which do not serve customers' production workflows. - ### Roadmap **Now** * @antiguru to complete `ALTER...SET CLUSTER` [#20841](https://github.com/MaterializeInc/materialize/issues/20841), without graceful rehydration. From ebad11ea4dd8a97e9f53cbc8f21e95bb7f4ab7cd Mon Sep 17 00:00:00 2001 From: Cara Haas Date: Wed, 11 Oct 2023 11:09:24 -0400 Subject: [PATCH 3/5] address review feedback, add distinction btwn development and production workflows --- .../design/20231002_cluster_vision.md | 46 +++++++++++++++++-- 1 file changed, 43 insertions(+), 3 deletions(-) diff --git a/doc/developer/design/20231002_cluster_vision.md b/doc/developer/design/20231002_cluster_vision.md index b0196198209e..3837bd375705 100644 --- a/doc/developer/design/20231002_cluster_vision.md +++ b/doc/developer/design/20231002_cluster_vision.md @@ -1,6 +1,8 @@ # Cluster UX Long Term Vision -- Associated: [Epic](https://github.com/MaterializeInc/materialize/issues/22120) +Associated: [Epic](https://github.com/MaterializeInc/materialize/issues/22120) + +Authors: @chaas, @antiguru, @benesch ## The Problem We need a documented vision for the cluster UX in the long term which covers both @@ -39,6 +41,38 @@ Imperative is like `CREATE/DROP CLUSTER REPLICA`. This means deprecating manual cluster replica management. \ We believe this is easier to use and manage. +We can classify actions that users take in managing clusters into two categories: +_development workflows_ and _production workflow_. + +#### Development workflows +In a development workflow, the underlying set of objects being configured are not being used yet +in a production system, and the user is rapidly changing things.\ +In this workflow, downtime is acceptable.\ +A command like `ALTER ... SET CLUSTER` (moving an object between clusters) would fall +under this category. + +For development workflows, since downtime is acceptable, the primary work items is to +**expose rehydration status**.\ +Users need an easy way to detect that rehydration is complete and they can resume querying against +the object. + +#### Production workflows +In a production workflow, the underlying set of objects are actively depended on by a production +system.\ +In this workflow, downtime is not acceptable.\ +A command like `ALTER CLUSTER ... SET (SIZE = <>)` (resizing a cluster) would fall under this +category. + +If a user wants to do a development workflow on a production system, they must use **blue/green +deployments**. For example, if the user wants to move an object between clusters, they must use +blue/green to set up another version of the object/cluster and cutover the production system +to it once the object is rehydrated and ready.\ +Again, for this workflow, exposing hydration status is the primary work item. + +For production workflows, like resizing an active cluster, blue/green is an acceptable intermediate +solution, but is an overkill amount of work for such a simple action. + +In an ideal state, we could provide a simple declarative interface for seamlessly resizing.\ The primary work item for this is **graceful reconfiguration**. At the moment, a change in size causes downtime until the new replicas are hydrated. As such, customers still want the flexibility to create their own replicas for graceful resizing. We can avoid this by leaving a subset of the original replicas around until the new replicas are hydrated. \ This requires us to 1) detect when hydration is complete and 2) trigger database object changes based on this event (without/based on an earlier DDL statement). @@ -66,22 +100,28 @@ We believe this will make it clearer how to achieve appropriate fault tolerance ### Support & testing Support is able to create create unbilled or partially billed cluster resources for resolving customer issues. This is soon to be possible via unbilled replicas [#20317](https://github.com/MaterializeInc/materialize/issues/20317). +Engineering may also want the ability to create unbilled shadow replicas for testing new features and +query plan changes, which do not serve customers' production workflows, if they can be made safe. + ### Roadmap **Now** * @antiguru to complete `ALTER...SET CLUSTER` [#20841](https://github.com/MaterializeInc/materialize/issues/20841), without graceful rehydration. * @antiguru to continue in-flight work on multipurpose clusters [#17413](https://github.com/MaterializeInc/materialize/issues/17413) - TODO(@antiguru): fill in details. * @ggnall to do discovery on the prescriptive data model as part of Blue/Green deployments project [#19748](https://github.com/MaterializeInc/materialize/issues/19748) +* Expose rehydration status [#22166](https://github.com/MaterializeInc/materialize/issues/22166) **Next** -* Graceful rehydration, to support graceful manual execution of `ALTER...SET CLUSTER` and `ALTER...SET SIZE`. +* Graceful reconfiguration, to support graceful manual execution of `ALTER...SET CLUSTER` and +`ALTER...SET SIZE`. * Deprecate `CREATE/DROP CLUSTER REPLICA` for users. **Later** * Auto-shutdown of clusters. * Shadow replicas. +* Autoscaling clusters. **Much Later** -* Autoscaling clusters / clusterless. +* Clusterless. ## Minimal Viable Prototype From fb322ba46df62a6d61abde1af0862743ab136998 Mon Sep 17 00:00:00 2001 From: Cara Haas Date: Mon, 23 Oct 2023 12:10:25 -0400 Subject: [PATCH 4/5] remove irrelevant sections --- .../design/20231002_cluster_vision.md | 49 ------------------- 1 file changed, 49 deletions(-) diff --git a/doc/developer/design/20231002_cluster_vision.md b/doc/developer/design/20231002_cluster_vision.md index 3837bd375705..849dbcfd9b67 100644 --- a/doc/developer/design/20231002_cluster_vision.md +++ b/doc/developer/design/20231002_cluster_vision.md @@ -122,52 +122,3 @@ query plan changes, which do not serve customers' production workflows, if they **Much Later** * Clusterless. - -## Minimal Viable Prototype - - - -## Alternatives - - - -## Open questions - - From 3b80531e29ffceea663bc1b357597e275546804e Mon Sep 17 00:00:00 2001 From: Cara Haas Date: Mon, 23 Oct 2023 16:48:11 -0400 Subject: [PATCH 5/5] update roadmap NOW to reflect @antiguru 's updates --- doc/developer/design/20231002_cluster_vision.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/developer/design/20231002_cluster_vision.md b/doc/developer/design/20231002_cluster_vision.md index 849dbcfd9b67..e1324aa432b2 100644 --- a/doc/developer/design/20231002_cluster_vision.md +++ b/doc/developer/design/20231002_cluster_vision.md @@ -105,8 +105,8 @@ query plan changes, which do not serve customers' production workflows, if they ### Roadmap **Now** -* @antiguru to complete `ALTER...SET CLUSTER` [#20841](https://github.com/MaterializeInc/materialize/issues/20841), without graceful rehydration. -* @antiguru to continue in-flight work on multipurpose clusters [#17413](https://github.com/MaterializeInc/materialize/issues/17413) - TODO(@antiguru): fill in details. +* @antiguru to work on `ALTER...SET CLUSTER` [#17417](https://github.com/MaterializeInc/materialize/issues/17417), without graceful rehydration. +* @antiguru to continue in-flight work on multipurpose clusters [#17413](https://github.com/MaterializeInc/materialize/issues/17413), which is co-locating compute and storage objects [PR #21846](https://github.com/MaterializeInc/materialize/pull/21846). * @ggnall to do discovery on the prescriptive data model as part of Blue/Green deployments project [#19748](https://github.com/MaterializeInc/materialize/issues/19748) * Expose rehydration status [#22166](https://github.com/MaterializeInc/materialize/issues/22166)