doc: Add design for scoped feature flags (per-cluster and per-replica)#36947
Open
antiguru wants to merge 8 commits into
Open
doc: Add design for scoped feature flags (per-cluster and per-replica)#36947antiguru wants to merge 8 commits into
antiguru wants to merge 8 commits into
Conversation
Two context kinds (cluster-coherent vs replica-local), dual id/name attributes for role-vs-incarnation targeting, in-memory reconciled storage instead of ALTER SYSTEM FOR CLUSTER DDL, and server-side service-connection billing.
Cluster-scoped LD overrides manual CREATE CLUSTER FEATURES, consistent with LD overriding ALTER SYSTEM globally. Ordering: env-wide LD < manual FEATURES < cluster-scoped LD, decided per-feature via variation_detail reason.
…taxonomy - ParameterScope (Environment/Cluster/Replica) declared at definition; drives evaluation, resolution, validation, and docs. - Size family taxonomy sourced from ClusterReplicaSizeMap (new per-size field). - Mark context-list growth and sync cadence as deferrable, non-blocking.
Every scoped evaluation keeps environment/organization/build in the multi-context alongside the scope context so rules can cross axes; do not duplicate env attributes onto cluster/replica contexts.
is_builtin is a clean invariant (System id / s-prefix) so the attribute is readable sugar; replica_size_family is a curated mapping that can't be safely derived via startsWith/endsWith, so the explicit attribute from the size map is required.
Scoped per-cluster/per-replica overrides are now persisted so they survive environmentd restart and LD unavailability (serving last-known values, falling back to env-wide only on a cold cache). Keyed by object id, sole writer is the sync loop, no user DDL; non-reused ids keep stale entries inert so GC is lazy hygiene, not a correctness concern. Matches how global flags already persist.
- Record a scoped row on difference-from-env-wide (not variation_detail reason): restores sparseness and makes the rule uniform across scopes. FALLTHROUGH is the env-wide value and RULE_MATCH can't identify the matching context kind, so the reason-based rule was both dense and incorrect. - Storage is two flat collections (cluster_/replica_system_configurations), mirroring system_configurations, not one sum-typed key. - Working copy rides in CatalogState/Arc<Catalog> so cluster overrides apply on fast-path peeks and bootstrap re-optimization, not just sequencing. - Clarify GC happens on first reconcile after startup, not at startup. - Note ReplicaAllocation::family() fallback to cc/legacy.
mgree
reviewed
Jun 10, 2026
| features per cluster from LD, e.g. enabling a feature on the catalog server | ||
| (or a specific user cluster) without affecting everyone else. | ||
|
|
||
| 2. **Per-replica overrides (replica-local flags), keyed by size family.** We are |
Contributor
There was a problem hiding this comment.
How does this interact with self-managed, where (as I understand it) customers can create their own replica sizes?
| /// Cluster-coherent: env-wide base + per-cluster overrides. Evaluated with | ||
| /// the `cluster` context (replica-free) and resolved at plan time via | ||
| /// `OptimizerFeatureOverrides`. e.g. optimizer features. | ||
| Cluster, |
Contributor
There was a problem hiding this comment.
Is there some way to uses this information to automate the OptimizerConfig/OptimizerFeatureOverrides? That would remove a lot of friction from making optimizer feature flags.
Member
Author
There was a problem hiding this comment.
I think yes, optimizer feature flags are just cluster-scoped flags. With my proposal, we likely wouldn't need the overrides anymore. That said, I'm not very familiar with the optimizer overrides.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This design document outlines the approach for implementing scoped feature flags in Materialize, enabling per-cluster and per-replica LaunchDarkly overrides. This addresses two concrete use cases:
mz_catalog_server) should be able to run with different optimizer feature sets without affecting the rest of the environment.D.1, etc.) should support different configurations (e.g., legacy sizes keeplgalloc, whileD.1enables the persist pager and LZ4 compression).Currently, feature flags are evaluated once per environment with no way to target specific clusters or replicas.
Description
This document introduces a comprehensive design for scoped system parameters that extends the existing LaunchDarkly integration to support two new scope classes:
clustercontext kind, applied at plan time viaOptimizerFeatureOverrides. Ensures all replicas of a cluster run the same optimized plans.replicacontext kind (which includes cluster and size family attributes), resolved at the controller's per-replica dyncfg push.Key design decisions:
clusterandreplica) to enforce coherence boundaries — cluster-coherent flags cannot vary by replica.Environment,Cluster, orReplica), enabling documentation, validation, and efficient evaluation.OptimizerFeatureOverridesat plan time.The design includes a minimal viable prototype roadmap, worked examples, and discussion of alternatives and open questions (deferred operational tuning concerns).
Verification
This is a design document with no code changes to verify. The document is self-contained and ready for review and discussion before implementation begins.
https://claude.ai/code/session_01S9fiehWEbC4BEXEq8p7LP9