Skip to content

Changing redundancy_mode to three_data_hall on a running cluster causes the cluster to become unavailable #2377

@hxu

Description

@hxu

What happened?

We were performing a disaster recovery exercise where we start a cluster in single redundancy mode, restore from backup, and then reconfigure the cluster to three_data_hall redundancy mode. Upon changing the redundancy mode, the cluster became unavailable.

We think this is because the locality_data_hall argument for the server processes was not yet set when the database configuration change was executed, so then the database was unable to recruit enough roles for the 3DH configuration.

Here's some select logs:

  1. We see the operator update the ConfigMap and then start updating pods:
Waiting for Kubernetes monitor config update
{
    "cluster": "foundationdb-cluster-main",
    "current": "<omitted>",
    "desired": "<omitted>",
    "logger": "controller",
    "namespace": "foundationdb",
    "pod": "foundationdb-cluster-main-log-16283",
    "time": "2025-10-07T16:18:12Z"
}

(the only difference between the current and desired configs is the addition of the locality_data_hall:

map[type:Concatenate values:[map[type:Literal value:--locality_data_hall=] map[source:NODE_LABEL_TOPOLOGY_KUBERNETES_IO_ZONE type:Environment]]])

Updating pod
{
    "FoundationDBCluster": "map[name:foundationdb-cluster-main namespace:foundationdb]",
    "controller": "foundationdbcluster",
    "controllerGroup": "apps.foundationdb.org",
    "controllerKind": "FoundationDBCluster",
    "name": "foundationdb-cluster-main-log-16283",
    "namespace": "foundationdb",
    "reconcileID": "b0f530e9-e28d-4ed2-aaf2-cf85aad77db6",
    "time": "2025-10-07T16:18:12Z",
    "updateMethod": "patch"
}
  1. The operator runs the configure database fdbcli command:
Configuring database
{
    "cluster": "foundationdb-cluster-main",
    "current configuration": "map[commit_proxies:8 grv_proxies:8 log_routers:-1 logs:9 perpetual_storage_wiggle:1 redundancy_mode:single remote_logs:-1 resolvers:1 storage_engine:ssd-redwood-1 storage_migration_type:gradual usable_regions:1]",
    "desired configuration": "map[commit_proxies:2 grv_proxies:1 log_routers:-1 logs:4 perpetual_storage_wiggle:1 redundancy_mode:three_data_hall remote_logs:-1 resolvers:1 storage_engine:ssd-redwood-1 storage_migration_type:gradual usable_regions:1]",
    "logger": "controller",
    "namespace": "foundationdb",
    "reconciler": "controllers.updateDatabaseConfiguration",
    "time": "2025-10-07T16:18:14Z",
    "traceID": "cd51da45-f3bb-481b-bff6-a3e47f91a2a4"
}

(we also changed some of the role counts in this case, but I don't think it's relevant because we replicated the issue by just changing the redundancy_mode)

Running command
{
    "args": "[/usr/bin/fdb/7.3/fdbcli --exec configure three_data_hall ssd-redwood-1 usable_regions=1 logs=4 resolvers=1 log_routers=-1 remote_logs=-1 commit_proxies=2 grv_proxies=1 regions=[] storage_migration_type=gradual perpetual_storage_wiggle=1 -C /tmp/adc20047-f7a4-4dce-a203-1128ca9a3a59-cli/1735726298 --log --trace_format xml --log-dir /var/log/fdb --timeout 10]",
    "cluster": "foundationdb-cluster-main",
    "logger": "controller.fdbclient",
    "namespace": "foundationdb",
    "path": "/usr/bin/fdb/7.3/fdbcli",
    "reconciler": "controllers.updateDatabaseConfiguration",
    "time": "2025-10-07T16:18:14Z",
    "traceID": "cd51da45-f3bb-481b-bff6-a3e47f91a2a4"
}
  1. The operator is then unable to do anything further because any operations like taking a lock or trying to get the status fail since the database is no longer available.

We were able to remediate the situation by manually execing into each pod and then killing the fdbserver process. This caused fdbmonitor to restart the process with the correct arguments, and then the database recovered. kill; kill all in fdbcli did not work since the database was unavailable.

What did you expect to happen?

Note that there is no process restart after the pods get the locality_data_hall configuration. Looking at the reconciliation loop, it seems like the bounceProcesses reconciler is after the updateDatabaseConfiguration reconciler, but by the time it gets there, the database is already unavailable.

I think that should happen is:

  1. The pod configs get updated with the locality_data_hall
  2. The operator bounces all the processes
  3. Then it runs updateDatabaseConfiguration.

Perhaps this should be an additional check in ConfigurationChangeAllowed so that the database config change is skipped until the cluster is fully reconciled?

How can we reproduce it (as minimally and precisely as possible)?

  1. Start a cluster in single redundancy mode (or any other mode that is not three data hall)
  2. Update the cluster spec to three_data_hall redundancy mode
  3. The operator should update the cluster configuration and then become unavailable

Anything else we need to know?

No response

FDB Kubernetes operator

kubectl fdb version -n foundationdb -o foundationdb-operator-helm
foundationdb-operator: 2.9.011
kubectl-fdb: latest

We are on a fork, so the open source operator version is 2.9.0

Kubernetes version

kubectl version
Client Version: v1.32.1
Kustomize Version: v5.5.0
Server Version: v1.32.801

Cloud provider

AWS

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions