-
Notifications
You must be signed in to change notification settings - Fork 102
Description
What happened?
We were performing a disaster recovery exercise where we start a cluster in single redundancy mode, restore from backup, and then reconfigure the cluster to three_data_hall redundancy mode. Upon changing the redundancy mode, the cluster became unavailable.
We think this is because the locality_data_hall argument for the server processes was not yet set when the database configuration change was executed, so then the database was unable to recruit enough roles for the 3DH configuration.
Here's some select logs:
- We see the operator update the ConfigMap and then start updating pods:
Waiting for Kubernetes monitor config update
{
"cluster": "foundationdb-cluster-main",
"current": "<omitted>",
"desired": "<omitted>",
"logger": "controller",
"namespace": "foundationdb",
"pod": "foundationdb-cluster-main-log-16283",
"time": "2025-10-07T16:18:12Z"
}
(the only difference between the current and desired configs is the addition of the locality_data_hall:
map[type:Concatenate values:[map[type:Literal value:--locality_data_hall=] map[source:NODE_LABEL_TOPOLOGY_KUBERNETES_IO_ZONE type:Environment]]])
Updating pod
{
"FoundationDBCluster": "map[name:foundationdb-cluster-main namespace:foundationdb]",
"controller": "foundationdbcluster",
"controllerGroup": "apps.foundationdb.org",
"controllerKind": "FoundationDBCluster",
"name": "foundationdb-cluster-main-log-16283",
"namespace": "foundationdb",
"reconcileID": "b0f530e9-e28d-4ed2-aaf2-cf85aad77db6",
"time": "2025-10-07T16:18:12Z",
"updateMethod": "patch"
}
- The operator runs the configure database fdbcli command:
Configuring database
{
"cluster": "foundationdb-cluster-main",
"current configuration": "map[commit_proxies:8 grv_proxies:8 log_routers:-1 logs:9 perpetual_storage_wiggle:1 redundancy_mode:single remote_logs:-1 resolvers:1 storage_engine:ssd-redwood-1 storage_migration_type:gradual usable_regions:1]",
"desired configuration": "map[commit_proxies:2 grv_proxies:1 log_routers:-1 logs:4 perpetual_storage_wiggle:1 redundancy_mode:three_data_hall remote_logs:-1 resolvers:1 storage_engine:ssd-redwood-1 storage_migration_type:gradual usable_regions:1]",
"logger": "controller",
"namespace": "foundationdb",
"reconciler": "controllers.updateDatabaseConfiguration",
"time": "2025-10-07T16:18:14Z",
"traceID": "cd51da45-f3bb-481b-bff6-a3e47f91a2a4"
}
(we also changed some of the role counts in this case, but I don't think it's relevant because we replicated the issue by just changing the redundancy_mode)
Running command
{
"args": "[/usr/bin/fdb/7.3/fdbcli --exec configure three_data_hall ssd-redwood-1 usable_regions=1 logs=4 resolvers=1 log_routers=-1 remote_logs=-1 commit_proxies=2 grv_proxies=1 regions=[] storage_migration_type=gradual perpetual_storage_wiggle=1 -C /tmp/adc20047-f7a4-4dce-a203-1128ca9a3a59-cli/1735726298 --log --trace_format xml --log-dir /var/log/fdb --timeout 10]",
"cluster": "foundationdb-cluster-main",
"logger": "controller.fdbclient",
"namespace": "foundationdb",
"path": "/usr/bin/fdb/7.3/fdbcli",
"reconciler": "controllers.updateDatabaseConfiguration",
"time": "2025-10-07T16:18:14Z",
"traceID": "cd51da45-f3bb-481b-bff6-a3e47f91a2a4"
}
- The operator is then unable to do anything further because any operations like taking a lock or trying to get the status fail since the database is no longer available.
We were able to remediate the situation by manually execing into each pod and then killing the fdbserver process. This caused fdbmonitor to restart the process with the correct arguments, and then the database recovered. kill; kill all in fdbcli did not work since the database was unavailable.
What did you expect to happen?
Note that there is no process restart after the pods get the locality_data_hall configuration. Looking at the reconciliation loop, it seems like the bounceProcesses reconciler is after the updateDatabaseConfiguration reconciler, but by the time it gets there, the database is already unavailable.
I think that should happen is:
- The pod configs get updated with the
locality_data_hall - The operator bounces all the processes
- Then it runs
updateDatabaseConfiguration.
Perhaps this should be an additional check in ConfigurationChangeAllowed so that the database config change is skipped until the cluster is fully reconciled?
How can we reproduce it (as minimally and precisely as possible)?
- Start a cluster in
singleredundancy mode (or any other mode that is not three data hall) - Update the cluster spec to
three_data_hallredundancy mode - The operator should update the cluster configuration and then become unavailable
Anything else we need to know?
No response
FDB Kubernetes operator
❯ kubectl fdb version -n foundationdb -o foundationdb-operator-helm
foundationdb-operator: 2.9.011
kubectl-fdb: latestWe are on a fork, so the open source operator version is 2.9.0
Kubernetes version
❯ kubectl version
Client Version: v1.32.1
Kustomize Version: v5.5.0
Server Version: v1.32.801