Changing redundancy_mode to three_data_hall on a running cluster causes the cluster to become unavailable

### What happened?

We were performing a disaster recovery exercise where we start a cluster in `single` redundancy mode, restore from backup, and then reconfigure the cluster to `three_data_hall` redundancy mode. Upon changing the redundancy mode, the cluster became unavailable.

We think this is because the `locality_data_hall` argument for the server processes was not yet set when the database configuration change was executed, so then the database was unable to recruit enough roles for the 3DH configuration.

Here's some select logs:

1. We see the operator update the ConfigMap and then start updating pods:

```
Waiting for Kubernetes monitor config update
{
    "cluster": "foundationdb-cluster-main",
    "current": "<omitted>",
    "desired": "<omitted>",
    "logger": "controller",
    "namespace": "foundationdb",
    "pod": "foundationdb-cluster-main-log-16283",
    "time": "2025-10-07T16:18:12Z"
}

(the only difference between the current and desired configs is the addition of the locality_data_hall:

map[type:Concatenate values:[map[type:Literal value:--locality_data_hall=] map[source:NODE_LABEL_TOPOLOGY_KUBERNETES_IO_ZONE type:Environment]]])

Updating pod
{
    "FoundationDBCluster": "map[name:foundationdb-cluster-main namespace:foundationdb]",
    "controller": "foundationdbcluster",
    "controllerGroup": "apps.foundationdb.org",
    "controllerKind": "FoundationDBCluster",
    "name": "foundationdb-cluster-main-log-16283",
    "namespace": "foundationdb",
    "reconcileID": "b0f530e9-e28d-4ed2-aaf2-cf85aad77db6",
    "time": "2025-10-07T16:18:12Z",
    "updateMethod": "patch"
}
```

2. The operator runs the configure database fdbcli command:

```
Configuring database
{
    "cluster": "foundationdb-cluster-main",
    "current configuration": "map[commit_proxies:8 grv_proxies:8 log_routers:-1 logs:9 perpetual_storage_wiggle:1 redundancy_mode:single remote_logs:-1 resolvers:1 storage_engine:ssd-redwood-1 storage_migration_type:gradual usable_regions:1]",
    "desired configuration": "map[commit_proxies:2 grv_proxies:1 log_routers:-1 logs:4 perpetual_storage_wiggle:1 redundancy_mode:three_data_hall remote_logs:-1 resolvers:1 storage_engine:ssd-redwood-1 storage_migration_type:gradual usable_regions:1]",
    "logger": "controller",
    "namespace": "foundationdb",
    "reconciler": "controllers.updateDatabaseConfiguration",
    "time": "2025-10-07T16:18:14Z",
    "traceID": "cd51da45-f3bb-481b-bff6-a3e47f91a2a4"
}

(we also changed some of the role counts in this case, but I don't think it's relevant because we replicated the issue by just changing the redundancy_mode)

Running command
{
    "args": "[/usr/bin/fdb/7.3/fdbcli --exec configure three_data_hall ssd-redwood-1 usable_regions=1 logs=4 resolvers=1 log_routers=-1 remote_logs=-1 commit_proxies=2 grv_proxies=1 regions=[] storage_migration_type=gradual perpetual_storage_wiggle=1 -C /tmp/adc20047-f7a4-4dce-a203-1128ca9a3a59-cli/1735726298 --log --trace_format xml --log-dir /var/log/fdb --timeout 10]",
    "cluster": "foundationdb-cluster-main",
    "logger": "controller.fdbclient",
    "namespace": "foundationdb",
    "path": "/usr/bin/fdb/7.3/fdbcli",
    "reconciler": "controllers.updateDatabaseConfiguration",
    "time": "2025-10-07T16:18:14Z",
    "traceID": "cd51da45-f3bb-481b-bff6-a3e47f91a2a4"
}
```

3. The operator is then unable to do anything further because any operations like taking a lock or trying to get the status fail since the database is no longer available.

We were able to remediate the situation by manually execing into each pod and then killing the `fdbserver` process. This caused `fdbmonitor` to restart the process with the correct arguments, and then the database recovered. `kill; kill all` in fdbcli did not work since the database was unavailable.

### What did you expect to happen?

Note that there is no process restart after the pods get the `locality_data_hall` configuration. Looking at the reconciliation loop, it seems like the `bounceProcesses` reconciler is after the `updateDatabaseConfiguration` reconciler, but by the time it gets there, the database is already unavailable.

I think that should happen is:

1. The pod configs get updated with the `locality_data_hall`
2. The operator bounces all the processes
3. Then it runs `updateDatabaseConfiguration`.

Perhaps this should be an additional check  in `ConfigurationChangeAllowed` so that the database config change is skipped until the cluster is fully reconciled?

### How can we reproduce it (as minimally and precisely as possible)?

1. Start a cluster in `single` redundancy mode (or any other mode that is not three data hall)
2. Update the cluster spec to `three_data_hall` redundancy mode
3. The operator should update the cluster configuration and then become unavailable

### Anything else we need to know?

_No response_

### FDB Kubernetes operator

<details>

```console
❯ kubectl fdb version -n foundationdb -o foundationdb-operator-helm
foundationdb-operator: 2.9.011
kubectl-fdb: latest
```

We are on a fork, so the open source operator version is 2.9.0
</details>


### Kubernetes version

<details>

```console
❯ kubectl version
Client Version: v1.32.1
Kustomize Version: v5.5.0
Server Version: v1.32.801
```

</details>


### Cloud provider

<details>
AWS
</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Changing redundancy_mode to three_data_hall on a running cluster causes the cluster to become unavailable #2377

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

FDB Kubernetes operator

Kubernetes version

Cloud provider

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Changing redundancy_mode to three_data_hall on a running cluster causes the cluster to become unavailable #2377

Description

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

FDB Kubernetes operator

Kubernetes version

Cloud provider

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions