Skip to content

doc: Add design for scoped feature flags (per-cluster and per-replica)#36947

Open
antiguru wants to merge 8 commits into
mainfrom
claude/gifted-mccarthy-owiuqv
Open

doc: Add design for scoped feature flags (per-cluster and per-replica)#36947
antiguru wants to merge 8 commits into
mainfrom
claude/gifted-mccarthy-owiuqv

Conversation

@antiguru

@antiguru antiguru commented Jun 9, 2026

Copy link
Copy Markdown
Member

Motivation

This design document outlines the approach for implementing scoped feature flags in Materialize, enabling per-cluster and per-replica LaunchDarkly overrides. This addresses two concrete use cases:

  1. Per-cluster optimizer flags: Different clusters (e.g., mz_catalog_server) should be able to run with different optimizer feature sets without affecting the rest of the environment.
  2. Per-replica flags by size family: Different replica size families (legacy t-shirt sizes, D.1, etc.) should support different configurations (e.g., legacy sizes keep lgalloc, while D.1 enables the persist pager and LZ4 compression).

Currently, feature flags are evaluated once per environment with no way to target specific clusters or replicas.

Description

This document introduces a comprehensive design for scoped system parameters that extends the existing LaunchDarkly integration to support two new scope classes:

  • Cluster-coherent flags: Evaluated replica-free using a cluster context kind, applied at plan time via OptimizerFeatureOverrides. Ensures all replicas of a cluster run the same optimized plans.
  • Replica-local flags: Evaluated per-replica using a replica context kind (which includes cluster and size family attributes), resolved at the controller's per-replica dyncfg push.

Key design decisions:

  1. Dual context kinds (cluster and replica) to enforce coherence boundaries — cluster-coherent flags cannot vary by replica.
  2. Required scope declaration on every synced parameter (Environment, Cluster, or Replica), enabling documentation, validation, and efficient evaluation.
  3. In-memory reconciliation from LD (not durable DDL) — scoped overrides are cached from continuous LD evaluation, avoiding recreate ambiguity and simplifying fallback behavior.
  4. Dual id/name attributes on contexts to support both role-based predicates (survive recreate) and incarnation pins (die with the object).
  5. Resolution at existing boundaries — replica-local flags resolve at the controller's dyncfg push; cluster-coherent flags feed into OptimizerFeatureOverrides at plan time.

The design includes a minimal viable prototype roadmap, worked examples, and discussion of alternatives and open questions (deferred operational tuning concerns).

Verification

This is a design document with no code changes to verify. The document is self-contained and ready for review and discussion before implementation begins.

https://claude.ai/code/session_01S9fiehWEbC4BEXEq8p7LP9

claude added 7 commits June 9, 2026 18:19
Two context kinds (cluster-coherent vs replica-local), dual id/name attributes
for role-vs-incarnation targeting, in-memory reconciled storage instead of
ALTER SYSTEM FOR CLUSTER DDL, and server-side service-connection billing.
Cluster-scoped LD overrides manual CREATE CLUSTER FEATURES, consistent with LD
overriding ALTER SYSTEM globally. Ordering: env-wide LD < manual FEATURES <
cluster-scoped LD, decided per-feature via variation_detail reason.
…taxonomy

- ParameterScope (Environment/Cluster/Replica) declared at definition; drives
  evaluation, resolution, validation, and docs.
- Size family taxonomy sourced from ClusterReplicaSizeMap (new per-size field).
- Mark context-list growth and sync cadence as deferrable, non-blocking.
Every scoped evaluation keeps environment/organization/build in the
multi-context alongside the scope context so rules can cross axes; do not
duplicate env attributes onto cluster/replica contexts.
is_builtin is a clean invariant (System id / s-prefix) so the attribute is
readable sugar; replica_size_family is a curated mapping that can't be safely
derived via startsWith/endsWith, so the explicit attribute from the size map is
required.
Scoped per-cluster/per-replica overrides are now persisted so they survive
environmentd restart and LD unavailability (serving last-known values, falling
back to env-wide only on a cold cache). Keyed by object id, sole writer is the
sync loop, no user DDL; non-reused ids keep stale entries inert so GC is lazy
hygiene, not a correctness concern. Matches how global flags already persist.
- Record a scoped row on difference-from-env-wide (not variation_detail
  reason): restores sparseness and makes the rule uniform across scopes.
  FALLTHROUGH is the env-wide value and RULE_MATCH can't identify the matching
  context kind, so the reason-based rule was both dense and incorrect.
- Storage is two flat collections (cluster_/replica_system_configurations),
  mirroring system_configurations, not one sum-typed key.
- Working copy rides in CatalogState/Arc<Catalog> so cluster overrides apply on
  fast-path peeks and bootstrap re-optimization, not just sequencing.
- Clarify GC happens on first reconcile after startup, not at startup.
- Note ReplicaAllocation::family() fallback to cc/legacy.
@antiguru antiguru marked this pull request as ready for review June 10, 2026 16:40
@antiguru antiguru requested review from aljoscha, def- and ggevay June 10, 2026 16:41
features per cluster from LD, e.g. enabling a feature on the catalog server
(or a specific user cluster) without affecting everyone else.

2. **Per-replica overrides (replica-local flags), keyed by size family.** We are

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this interact with self-managed, where (as I understand it) customers can create their own replica sizes?

/// Cluster-coherent: env-wide base + per-cluster overrides. Evaluated with
/// the `cluster` context (replica-free) and resolved at plan time via
/// `OptimizerFeatureOverrides`. e.g. optimizer features.
Cluster,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way to uses this information to automate the OptimizerConfig/OptimizerFeatureOverrides? That would remove a lot of friction from making optimizer feature flags.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes, optimizer feature flags are just cluster-scoped flags. With my proposal, we likely wouldn't need the overrides anymore. That said, I'm not very familiar with the optimizer overrides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants