From eb71f7ce5474cee6d7124e5d2faf34857fb23609 Mon Sep 17 00:00:00 2001
From: Cara Haas <cara@materialize.com>
Date: Mon, 2 Oct 2023 17:45:49 -0400
Subject: [PATCH 1/5] first draft of cluster ux long term vision document

---
 .../design/20231002_cluster_vision.md         | 145 ++++++++++++++++++
 1 file changed, 145 insertions(+)
 create mode 100644 doc/developer/design/20231002_cluster_vision.md

diff --git a/doc/developer/design/20231002_cluster_vision.md b/doc/developer/design/20231002_cluster_vision.md
new file mode 100644
index 000000000000..4987b1cdd5cb
--- /dev/null
+++ b/doc/developer/design/20231002_cluster_vision.md
@@ -0,0 +1,145 @@
+# Cluster UX Long Term Vision
+
+- Associated: [Epic](https://github.com/MaterializeInc/materialize/issues/22120)
+
+<!--
+The goal of a design document is to thoroughly discover problems and
+examine potential solutions before moving into the delivery phase of
+a project. In order to be ready to share, a design document must address
+the questions in each of the following sections. Any additional content
+is at the discretion of the author.
+
+Note: Feel free to add or remove sections as needed. However, most design
+docs should at least keep the suggested sections.
+-->
+
+## The Problem
+We need a documented vision for the cluster UX in the long term which covers both
+the "end state" goal as well as the short and medium states in order to:
+* Make product prioritization decisions around cluster work
+* Communicate to customers what to expect around cluster management
+* Set expectations for other projects on how they should be interacting with clusters
+
+Epic: https://github.com/MaterializeInc/materialize/issues/22120
+
+## Success Criteria
+Primarily, a merged design doc that is reviewed and approved by EPD leadership,
+and is socialized to GTM.
+
+Secondarily, a roadmap for cluster work for the next quarter.
+
+Qualitatively, positive feedback from EPD leadership and GTM folks that they
+have clarity [TODO(chaas) define this more explicitly].
+
+## Out of Scope
+Designing the actual cluster API changes themselves, or proposing implementation details.
+
+## Solution Proposal
+The objectives we are striving for with the cluster UX:
+* Easy to use and manage
+* Maximize resource efficiency/minimize unused resource cost
+* Enable fault tolerance/use-case isolation
+
+### Declarative vs Imperative
+We should move toward a declarative API for managing clusters, where:
+
+Declarative is like `CREATE CLUSTER` with managed replicas and \
+Imperative is like `CREATE/DROP CLUSTER REPLICA`.
+
+This means deprecating manual cluster replica management. \
+We believe this is easier to use and manage.
+
+The primary work item for this is **graceful rehydration**. At the moment, a change in size causes downtime until the new replicas are hydrated. As such, customers still want the flexibility to create their own replicas for graceful resizing. We can avoid this by leaving a subset of the original replicas around until the new replicas are hydrated. \
+This requires us to 1) detect when hydration is complete and 2) trigger database object changes based on this event (without/based on an earlier DDL statement).
+
+Another consideration is internal use-cases, such as unbilled replicas. We may want to keep around an imperitive API for internal (support) use only.
+
+To be determined: whether replica sets fits into this model, either externally exposed or internal-only. Perhaps they are a way we could recover clusters with heterogeneous replicas while retaining a declarative API.
+
+### Resource usage
+The very long-term goal is clusterless Materialize, where Materialize does automatic workload scheduling for the customer.
+
+An intermediary solution, which is also far off is autoscaling of clusters, where Materialize automatically resizes clusters based on the observed workload.
+
+A more achievable offering in the short-term is automatic shutdown of clusters, where Materialize can spin down a cluster to 0 replicas based on certain criteria, such as a scheduled time or amount of idle time. \
+This would reduce resource waste for development clusters. The triggering mechanism from graceful rehydration is also a requirement here.
+
+### Data model
+We should move toward prescriptive guidance on how users should configure their clusters with respect to databases and schemas, \
+e.g. should clusters typically be scoped to a single schema.
+
+We should also be more prescriptive about what data should be colocated, \
+e.g. when should the user create a new cluster for their new sources/MVs/indexes versus increase the size of their existing cluster.
+
+We believe this will make it clearer how to achieve appropriate fault tolerance and maxmimize resource efficiency.
+
+### Support & testing
+Support is able to create create unbilled or partially billed cluster resources for resolving customer issues. This is soon to be possible via unbilled replicas [#20317](https://github.com/MaterializeInc/materialize/issues/20317).
+
+Engineering is also able to create additional unbilled shadow replicas for testing new features and query plan changes, which do not serve customers' production workflows.
+
+### Roadmap
+**Now**
+* @antiguru to complete `ALTER...SET CLUSTER` [#20841](https://github.com/MaterializeInc/materialize/issues/20841), without graceful rehydration.
+* @antiguru to continue in-flight work on multipurpose clusters [#17413](https://github.com/MaterializeInc/materialize/issues/17413) - TODO(@antiguru): fill in details.
+* @ggnall to do discovery on the prescriptive data model as part of Blue/Green deployments project [#19748](https://github.com/MaterializeInc/materialize/issues/19748)
+
+**Next**
+* Graceful rehydration, to support graceful manual execution of `ALTER...SET CLUSTER` and `ALTER...SET SIZE`.
+* Deprecate `CREATE/DROP CLUSTER REPLICA` for users.
+
+**Later**
+* Auto-shutdown of clusters.
+* Shadow replicas.
+
+**Much Later**
+* Autoscaling clusters / clusterless.
+
+## Minimal Viable Prototype
+
+<!--
+Build and share the minimal viable version of your project to validate the
+design, value, and user experience. Depending on the project, your prototype
+might look like:
+
+- A Figma wireframe, or fuller prototype
+- SQL syntax that isn't actually attached to anything on the backend
+- A hacky but working live demo of a solution running on your laptop or in a
+  staging environment
+
+The best prototypes will be validated by Materialize team members as well
+as prospects and customers. If you want help getting your prototype in front
+of external folks, reach out to the Product team in #product.
+
+This step is crucial for de-risking the design as early as possible and a
+prototype is required in most cases. In _some_ cases it can be beneficial to
+get eyes on the initial proposal without a prototype. If you think that
+there is a good reason for skpiping or delaying the prototype, please
+explicitly mention it in this section and provide details on why you you'd
+like to skip or delay it.
+-->
+
+## Alternatives
+
+<!--
+What other solutions were considered, and why weren't they chosen?
+
+This is your chance to demonstrate that you've fully discovered the problem.
+Alternative solutions can come from many places, like: you or your Materialize
+team members, our customers, our prospects, academic research, prior art, or
+competitive research. One of our company values is to "do the reading" and
+to "write things down." This is your opportunity to demonstrate both!
+-->
+
+## Open questions
+
+<!--
+What is left unaddressed by this design document that needs to be
+closed out?
+
+When a design document is authored and shared, there might still be
+open questions that need to be explored. Through the design document
+process, you are responsible for getting answers to these open
+questions. All open questions should be answered by the time a design
+document is merged.
+-->

From ebf29dcbfd8fc6bf93d945787ab6d0ab2de76ab7 Mon Sep 17 00:00:00 2001
From: Cara Haas <cara@materialize.com>
Date: Tue, 3 Oct 2023 21:34:36 -0400
Subject: [PATCH 2/5] revisions based on review feedback from moritz + chat
 with jessica

---
 .../design/20231002_cluster_vision.md         | 22 +++++--------------
 1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/doc/developer/design/20231002_cluster_vision.md b/doc/developer/design/20231002_cluster_vision.md
index 4987b1cdd5cb..b0196198209e 100644
--- a/doc/developer/design/20231002_cluster_vision.md
+++ b/doc/developer/design/20231002_cluster_vision.md
@@ -2,23 +2,13 @@
 
 - Associated: [Epic](https://github.com/MaterializeInc/materialize/issues/22120)
 
-<!--
-The goal of a design document is to thoroughly discover problems and
-examine potential solutions before moving into the delivery phase of
-a project. In order to be ready to share, a design document must address
-the questions in each of the following sections. Any additional content
-is at the discretion of the author.
-
-Note: Feel free to add or remove sections as needed. However, most design
-docs should at least keep the suggested sections.
--->
-
 ## The Problem
 We need a documented vision for the cluster UX in the long term which covers both
 the "end state" goal as well as the short and medium states in order to:
+* Ensure alignment in the future that we are working toward
 * Make product prioritization decisions around cluster work
-* Communicate to customers what to expect around cluster management
-* Set expectations for other projects on how they should be interacting with clusters
+* Make folks more comfortable accepting intermediate states that aren't ideal in service of a greater goal
+* Come up with a narrative for customers on what to expect around cluster management
 
 Epic: https://github.com/MaterializeInc/materialize/issues/22120
 
@@ -29,7 +19,7 @@ and is socialized to GTM.
 Secondarily, a roadmap for cluster work for the next quarter.
 
 Qualitatively, positive feedback from EPD leadership and GTM folks that they
-have clarity [TODO(chaas) define this more explicitly].
+have clarity on the vision and roadmap, and the reasoning behind those decisions.
 
 ## Out of Scope
 Designing the actual cluster API changes themselves, or proposing implementation details.
@@ -49,7 +39,7 @@ Imperative is like `CREATE/DROP CLUSTER REPLICA`.
 This means deprecating manual cluster replica management. \
 We believe this is easier to use and manage.
 
-The primary work item for this is **graceful rehydration**. At the moment, a change in size causes downtime until the new replicas are hydrated. As such, customers still want the flexibility to create their own replicas for graceful resizing. We can avoid this by leaving a subset of the original replicas around until the new replicas are hydrated. \
+The primary work item for this is **graceful reconfiguration**. At the moment, a change in size causes downtime until the new replicas are hydrated. As such, customers still want the flexibility to create their own replicas for graceful resizing. We can avoid this by leaving a subset of the original replicas around until the new replicas are hydrated. \
 This requires us to 1) detect when hydration is complete and 2) trigger database object changes based on this event (without/based on an earlier DDL statement).
 
 Another consideration is internal use-cases, such as unbilled replicas. We may want to keep around an imperitive API for internal (support) use only.
@@ -76,8 +66,6 @@ We believe this will make it clearer how to achieve appropriate fault tolerance
 ### Support & testing
 Support is able to create create unbilled or partially billed cluster resources for resolving customer issues. This is soon to be possible via unbilled replicas [#20317](https://github.com/MaterializeInc/materialize/issues/20317).
 
-Engineering is also able to create additional unbilled shadow replicas for testing new features and query plan changes, which do not serve customers' production workflows.
-
 ### Roadmap
 **Now**
 * @antiguru to complete `ALTER...SET CLUSTER` [#20841](https://github.com/MaterializeInc/materialize/issues/20841), without graceful rehydration.

From ebad11ea4dd8a97e9f53cbc8f21e95bb7f4ab7cd Mon Sep 17 00:00:00 2001
From: Cara Haas <cara@materialize.com>
Date: Wed, 11 Oct 2023 11:09:24 -0400
Subject: [PATCH 3/5] address review feedback, add distinction btwn development
 and production workflows

---
 .../design/20231002_cluster_vision.md         | 46 +++++++++++++++++--
 1 file changed, 43 insertions(+), 3 deletions(-)

diff --git a/doc/developer/design/20231002_cluster_vision.md b/doc/developer/design/20231002_cluster_vision.md
index b0196198209e..3837bd375705 100644
--- a/doc/developer/design/20231002_cluster_vision.md
+++ b/doc/developer/design/20231002_cluster_vision.md
@@ -1,6 +1,8 @@
 # Cluster UX Long Term Vision
 
-- Associated: [Epic](https://github.com/MaterializeInc/materialize/issues/22120)
+Associated: [Epic](https://github.com/MaterializeInc/materialize/issues/22120)
+
+Authors: @chaas, @antiguru, @benesch
 
 ## The Problem
 We need a documented vision for the cluster UX in the long term which covers both
@@ -39,6 +41,38 @@ Imperative is like `CREATE/DROP CLUSTER REPLICA`.
 This means deprecating manual cluster replica management. \
 We believe this is easier to use and manage.
 
+We can classify actions that users take in managing clusters into two categories:
+_development workflows_ and _production workflow_.
+
+#### Development workflows
+In a development workflow, the underlying set of objects being configured are not being used yet
+in a production system, and the user is rapidly changing things.\
+In this workflow, downtime is acceptable.\
+A command like `ALTER <object> ... SET CLUSTER` (moving an object between clusters) would fall
+under this category.
+
+For development workflows, since downtime is acceptable, the primary work items is to
+**expose rehydration status**.\
+Users need an easy way to detect that rehydration is complete and they can resume querying against
+the object.
+
+#### Production workflows
+In a production workflow, the underlying set of objects are actively depended on by a production
+system.\
+In this workflow, downtime is not acceptable.\
+A command like `ALTER CLUSTER ... SET (SIZE = <>)` (resizing a cluster) would fall under this
+category.
+
+If a user wants to do a development workflow on a production system, they must use **blue/green
+deployments**. For example, if the user wants to move an object between clusters, they must use
+blue/green to set up another version of the object/cluster and cutover the production system
+to it once the object is rehydrated and ready.\
+Again, for this workflow, exposing hydration status is the primary work item.
+
+For production workflows, like resizing an active cluster, blue/green is an acceptable intermediate
+solution, but is an overkill amount of work for such a simple action.
+
+In an ideal state, we could provide a simple declarative interface for seamlessly resizing.\
 The primary work item for this is **graceful reconfiguration**. At the moment, a change in size causes downtime until the new replicas are hydrated. As such, customers still want the flexibility to create their own replicas for graceful resizing. We can avoid this by leaving a subset of the original replicas around until the new replicas are hydrated. \
 This requires us to 1) detect when hydration is complete and 2) trigger database object changes based on this event (without/based on an earlier DDL statement).
 
@@ -66,22 +100,28 @@ We believe this will make it clearer how to achieve appropriate fault tolerance
 ### Support & testing
 Support is able to create create unbilled or partially billed cluster resources for resolving customer issues. This is soon to be possible via unbilled replicas [#20317](https://github.com/MaterializeInc/materialize/issues/20317).
 
+Engineering may also want the ability to create unbilled shadow replicas for testing new features and
+query plan changes, which do not serve customers' production workflows, if they can be made safe.
+
 ### Roadmap
 **Now**
 * @antiguru to complete `ALTER...SET CLUSTER` [#20841](https://github.com/MaterializeInc/materialize/issues/20841), without graceful rehydration.
 * @antiguru to continue in-flight work on multipurpose clusters [#17413](https://github.com/MaterializeInc/materialize/issues/17413) - TODO(@antiguru): fill in details.
 * @ggnall to do discovery on the prescriptive data model as part of Blue/Green deployments project [#19748](https://github.com/MaterializeInc/materialize/issues/19748)
+* Expose rehydration status [#22166](https://github.com/MaterializeInc/materialize/issues/22166)
 
 **Next**
-* Graceful rehydration, to support graceful manual execution of `ALTER...SET CLUSTER` and `ALTER...SET SIZE`.
+* Graceful reconfiguration, to support graceful manual execution of `ALTER...SET CLUSTER` and
+`ALTER...SET SIZE`.
 * Deprecate `CREATE/DROP CLUSTER REPLICA` for users.
 
 **Later**
 * Auto-shutdown of clusters.
 * Shadow replicas.
+* Autoscaling clusters.
 
 **Much Later**
-* Autoscaling clusters / clusterless.
+* Clusterless.
 
 ## Minimal Viable Prototype
 

From fb322ba46df62a6d61abde1af0862743ab136998 Mon Sep 17 00:00:00 2001
From: Cara Haas <cara@materialize.com>
Date: Mon, 23 Oct 2023 12:10:25 -0400
Subject: [PATCH 4/5] remove irrelevant sections

---
 .../design/20231002_cluster_vision.md         | 49 -------------------
 1 file changed, 49 deletions(-)

diff --git a/doc/developer/design/20231002_cluster_vision.md b/doc/developer/design/20231002_cluster_vision.md
index 3837bd375705..849dbcfd9b67 100644
--- a/doc/developer/design/20231002_cluster_vision.md
+++ b/doc/developer/design/20231002_cluster_vision.md
@@ -122,52 +122,3 @@ query plan changes, which do not serve customers' production workflows, if they
 
 **Much Later**
 * Clusterless.
-
-## Minimal Viable Prototype
-
-<!--
-Build and share the minimal viable version of your project to validate the
-design, value, and user experience. Depending on the project, your prototype
-might look like:
-
-- A Figma wireframe, or fuller prototype
-- SQL syntax that isn't actually attached to anything on the backend
-- A hacky but working live demo of a solution running on your laptop or in a
-  staging environment
-
-The best prototypes will be validated by Materialize team members as well
-as prospects and customers. If you want help getting your prototype in front
-of external folks, reach out to the Product team in #product.
-
-This step is crucial for de-risking the design as early as possible and a
-prototype is required in most cases. In _some_ cases it can be beneficial to
-get eyes on the initial proposal without a prototype. If you think that
-there is a good reason for skpiping or delaying the prototype, please
-explicitly mention it in this section and provide details on why you you'd
-like to skip or delay it.
--->
-
-## Alternatives
-
-<!--
-What other solutions were considered, and why weren't they chosen?
-
-This is your chance to demonstrate that you've fully discovered the problem.
-Alternative solutions can come from many places, like: you or your Materialize
-team members, our customers, our prospects, academic research, prior art, or
-competitive research. One of our company values is to "do the reading" and
-to "write things down." This is your opportunity to demonstrate both!
--->
-
-## Open questions
-
-<!--
-What is left unaddressed by this design document that needs to be
-closed out?
-
-When a design document is authored and shared, there might still be
-open questions that need to be explored. Through the design document
-process, you are responsible for getting answers to these open
-questions. All open questions should be answered by the time a design
-document is merged.
--->

From 3b80531e29ffceea663bc1b357597e275546804e Mon Sep 17 00:00:00 2001
From: Cara Haas <cara@materialize.com>
Date: Mon, 23 Oct 2023 16:48:11 -0400
Subject: [PATCH 5/5] update roadmap NOW to reflect @antiguru 's updates

---
 doc/developer/design/20231002_cluster_vision.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/developer/design/20231002_cluster_vision.md b/doc/developer/design/20231002_cluster_vision.md
index 849dbcfd9b67..e1324aa432b2 100644
--- a/doc/developer/design/20231002_cluster_vision.md
+++ b/doc/developer/design/20231002_cluster_vision.md
@@ -105,8 +105,8 @@ query plan changes, which do not serve customers' production workflows, if they
 
 ### Roadmap
 **Now**
-* @antiguru to complete `ALTER...SET CLUSTER` [#20841](https://github.com/MaterializeInc/materialize/issues/20841), without graceful rehydration.
-* @antiguru to continue in-flight work on multipurpose clusters [#17413](https://github.com/MaterializeInc/materialize/issues/17413) - TODO(@antiguru): fill in details.
+* @antiguru to work on `ALTER...SET CLUSTER` [#17417](https://github.com/MaterializeInc/materialize/issues/17417), without graceful rehydration.
+* @antiguru to continue in-flight work on multipurpose clusters [#17413](https://github.com/MaterializeInc/materialize/issues/17413), which is co-locating compute and storage objects [PR #21846](https://github.com/MaterializeInc/materialize/pull/21846).
 * @ggnall to do discovery on the prescriptive data model as part of Blue/Green deployments project [#19748](https://github.com/MaterializeInc/materialize/issues/19748)
 * Expose rehydration status [#22166](https://github.com/MaterializeInc/materialize/issues/22166)