Cluster recovery improvements #13754

MggMuggins · 2024-07-12T23:01:15Z

A detailed description of the problem this resolves is in the issue. This PR:

lxd cluster edit:
- Prompts the user with a warning & link to the docs prior to cluster edit
- Includes more information about member roles in the yaml during editing
- Generates a tarball (/var/snap/lxd/common/lxd/database/recovery_db.tar.gz) with the contents of the database & the new raft configuration as yaml
- Creates a patch.global.sql to update the addresses of any nodes that were changed in the global nodes table
On Daemon startup, looks for & loads a recovery tarball at /var/snap/lxd/common/lxd/database/recovery_db.tar.gz and:
- Replaces the existing database contents with the incoming contents
- Updates cluster member addresses in the local DB's raft_nodes table

Currently the unpack code also creates the global sql patch on each node; in general this should only be done on one node. Since the patch is idempotent (it's just a couple of UPDATE), it works fine, but it's unideal. See my comment below.

LXD-1194

github-actions · 2024-07-12T23:01:30Z

Heads up @mionaalex - the "Documentation" label was applied to this issue.

doc/howto/cluster_recover.md

ru-fu

Thanks, the docs look good now!

MggMuggins · 2024-07-16T14:11:07Z

When applying patch.global.sql on only one cluster member, occasionally database startup fails for one or (usually) more members (patch applied).

Error: Failed to initialize global database: failed to ensure schema: Failed to ensure schema: Failed to update cluster member version info: updated 0 rows instead of 1 with address "10.1.1.102:9393"

That error tracks back to https://github.com/canonical/lxd/blob/main/lxd/db/cluster/query.go#L15-L17

That query selects WHERE address=?; when we load the recovery tarball, we change the value of cluster.https_address, which is used as the address parameter in that query. If patch.global.sql has not been committed yet, the query affects zero rows instead of one (as it is looking for the old address). I inserted a select * from nodes before that query to test and it returns the old addresses when the failure happens.

While schema.Ensure (which calls updateNodeVersion) is wrapped in query.Retry, somehow this failure sometimes persists through the retry and causes the daemon to fail to start.

This isn't an issue with the current cluster edit because patch.global.sql is created on every node (Reconfigure is called on each cluster member per the docs). This ensures that the address changes are visible to each schema.Ensure transaction before updateNodeVersion is called.

The easy solution/workaround here is simply to create patch.global.sql on each cluster member. The patch is idempotent so there's no harm in it beyond doing more work than needed on startup.

Alternatively, it may be possible to use the node ID instead of address to update the schema/api_extensions fields in the nodes table. Since we don't have access to the global database before ensuring the schema, this would require getting the ID from the local DB. I'm not sure if we can rely on the raft_nodes.id in the local DB being the same as nodes.id in the global DB.

@tomponline Let me know what you think is the more reasonable approach or if this doesn't sound right. Thanks!

lxd/cluster/recover.go

test/suites/clustering.sh

MggMuggins · 2024-07-19T21:23:21Z

I renamed the recovery tarball to lxd_recovery_db.tar.gz so as to prevent confusion with other recovery tarballs for other Micro* daemons. I assume that LXD would not be the only service to lose quorum in a cluster recovery situation.

A fix for the same issue as in LXD: canonical/lxd#13754 (comment) Signed-off-by: Wesley Hershberger <[email protected]>

lxd/cluster/recover.go

lxd/main_cluster.go

doc/howto/cluster_recover.md

ReconfigureMembershipExt takes into consideration a node's dqlite role. Signed-off-by: Wesley Hershberger <[email protected]>

...for cluster recovery Signed-off-by: Wesley Hershberger <[email protected]>

Signed-off-by: Wesley Hershberger <[email protected]>

Since this is a somewhat arbitrary check, we should make sure to do it before we've mutated the dqlite dir state. Signed-off-by: Wesley Hershberger <[email protected]>

Signed-off-by: Wesley Hershberger <[email protected]>

`lxd cluster edit` on each node is no longer supported. Signed-off-by: Wesley Hershberger <[email protected]>

Signed-off-by: Wesley Hershberger <[email protected]>

MggMuggins added the Documentation Documentation needs updating label Jul 12, 2024

MggMuggins force-pushed the cluster-recovery branch 2 times, most recently from c92f17e to 7cf25f1 Compare July 13, 2024 04:51

ru-fu reviewed Jul 15, 2024

View reviewed changes

doc/howto/cluster_recover.md Outdated Show resolved Hide resolved

doc/howto/cluster_recover.md Outdated Show resolved Hide resolved

doc/howto/cluster_recover.md Outdated Show resolved Hide resolved

MggMuggins force-pushed the cluster-recovery branch 2 times, most recently from f34fd5c to c8f2d3d Compare July 15, 2024 20:35

MggMuggins requested a review from ru-fu July 15, 2024 20:35

MggMuggins force-pushed the cluster-recovery branch 2 times, most recently from 6743b4c to 1d49286 Compare July 15, 2024 22:28

ru-fu previously approved these changes Jul 16, 2024

View reviewed changes

MggMuggins dismissed ru-fu’s stale review via c5b3187 July 16, 2024 14:13

MggMuggins marked this pull request as ready for review July 16, 2024 14:56

tomponline reviewed Jul 19, 2024

View reviewed changes

lxd/cluster/recover.go Outdated Show resolved Hide resolved

tomponline reviewed Jul 19, 2024

View reviewed changes

lxd/cluster/recover.go Outdated Show resolved Hide resolved

MggMuggins commented Jul 19, 2024

View reviewed changes

lxd/cluster/recover.go Outdated Show resolved Hide resolved

MggMuggins commented Jul 19, 2024

View reviewed changes

test/suites/clustering.sh Show resolved Hide resolved

MggMuggins commented Jul 19, 2024

View reviewed changes

test/suites/clustering.sh Show resolved Hide resolved

MggMuggins force-pushed the cluster-recovery branch from c5b3187 to e5b3076 Compare July 19, 2024 21:20

MggMuggins added a commit to MggMuggins/microcluster that referenced this pull request Jul 29, 2024

internal/recover: Write the global DB patch on all nodes

a0a11ee

A fix for the same issue as in LXD: canonical/lxd#13754 (comment) Signed-off-by: Wesley Hershberger <[email protected]>

MggMuggins mentioned this pull request Jul 29, 2024

internal/recover: Write the global DB patch on all members canonical/microcluster#208

Closed