Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster recovery improvements #13754

Merged
merged 16 commits into from
Sep 4, 2024
Merged

Conversation

MggMuggins
Copy link
Contributor

@MggMuggins MggMuggins commented Jul 12, 2024

Fixes #13524

A detailed description of the problem this resolves is in the issue. This PR:

  • lxd cluster edit:
    • Prompts the user with a warning & link to the docs prior to cluster edit
    • Includes more information about member roles in the yaml during editing
    • Generates a tarball (/var/snap/lxd/common/lxd/database/recovery_db.tar.gz) with the contents of the database & the new raft configuration as yaml
    • Creates a patch.global.sql to update the addresses of any nodes that were changed in the global nodes table
  • On Daemon startup, looks for & loads a recovery tarball at /var/snap/lxd/common/lxd/database/recovery_db.tar.gz and:
    • Replaces the existing database contents with the incoming contents
    • Updates cluster member addresses in the local DB's raft_nodes table

Currently the unpack code also creates the global sql patch on each node; in general this should only be done on one node. Since the patch is idempotent (it's just a couple of UPDATE), it works fine, but it's unideal. See my comment below.

LXD-1194

@MggMuggins MggMuggins added the Documentation Documentation needs updating label Jul 12, 2024
Copy link

github-actions bot commented Jul 12, 2024

Heads up @mionaalex - the "Documentation" label was applied to this issue.

@MggMuggins MggMuggins force-pushed the cluster-recovery branch 2 times, most recently from c92f17e to 7cf25f1 Compare July 13, 2024 04:51
doc/howto/cluster_recover.md Outdated Show resolved Hide resolved
doc/howto/cluster_recover.md Outdated Show resolved Hide resolved
doc/howto/cluster_recover.md Outdated Show resolved Hide resolved
@MggMuggins MggMuggins force-pushed the cluster-recovery branch 2 times, most recently from f34fd5c to c8f2d3d Compare July 15, 2024 20:35
@MggMuggins MggMuggins requested a review from ru-fu July 15, 2024 20:35
@MggMuggins MggMuggins force-pushed the cluster-recovery branch 2 times, most recently from 6743b4c to 1d49286 Compare July 15, 2024 22:28
ru-fu
ru-fu previously approved these changes Jul 16, 2024
Copy link
Contributor

@ru-fu ru-fu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the docs look good now!

@MggMuggins
Copy link
Contributor Author

When applying patch.global.sql on only one cluster member, occasionally database startup fails for one or (usually) more members (patch applied).

Error: Failed to initialize global database: failed to ensure schema: Failed to ensure schema: Failed to update cluster member version info: updated 0 rows instead of 1 with address "10.1.1.102:9393"

That error tracks back to https://github.com/canonical/lxd/blob/main/lxd/db/cluster/query.go#L15-L17

That query selects WHERE address=?; when we load the recovery tarball, we change the value of cluster.https_address, which is used as the address parameter in that query. If patch.global.sql has not been committed yet, the query affects zero rows instead of one (as it is looking for the old address). I inserted a select * from nodes before that query to test and it returns the old addresses when the failure happens.

While schema.Ensure (which calls updateNodeVersion) is wrapped in query.Retry, somehow this failure sometimes persists through the retry and causes the daemon to fail to start.

This isn't an issue with the current cluster edit because patch.global.sql is created on every node (Reconfigure is called on each cluster member per the docs). This ensures that the address changes are visible to each schema.Ensure transaction before updateNodeVersion is called.

The easy solution/workaround here is simply to create patch.global.sql on each cluster member. The patch is idempotent so there's no harm in it beyond doing more work than needed on startup.

Alternatively, it may be possible to use the node ID instead of address to update the schema/api_extensions fields in the nodes table. Since we don't have access to the global database before ensuring the schema, this would require getting the ID from the local DB. I'm not sure if we can rely on the raft_nodes.id in the local DB being the same as nodes.id in the global DB.

@tomponline Let me know what you think is the more reasonable approach or if this doesn't sound right. Thanks!

@MggMuggins MggMuggins marked this pull request as ready for review July 16, 2024 14:56
lxd/cluster/recover.go Outdated Show resolved Hide resolved
lxd/cluster/recover.go Outdated Show resolved Hide resolved
lxd/cluster/recover.go Outdated Show resolved Hide resolved
@MggMuggins
Copy link
Contributor Author

I renamed the recovery tarball to lxd_recovery_db.tar.gz so as to prevent confusion with other recovery tarballs for other Micro* daemons. I assume that LXD would not be the only service to lose quorum in a cluster recovery situation.

MggMuggins added a commit to MggMuggins/microcluster that referenced this pull request Jul 29, 2024
A fix for the same issue as in LXD: canonical/lxd#13754 (comment)

Signed-off-by: Wesley Hershberger <[email protected]>
lxd/cluster/recover.go Outdated Show resolved Hide resolved
lxd/cluster/recover.go Outdated Show resolved Hide resolved
lxd/main_cluster.go Outdated Show resolved Hide resolved
lxd/main_cluster.go Outdated Show resolved Hide resolved
lxd/main_cluster.go Outdated Show resolved Hide resolved
@MggMuggins MggMuggins force-pushed the cluster-recovery branch 2 times, most recently from 394ba76 to 69c9a14 Compare August 29, 2024 22:24
ReconfigureMembershipExt takes into consideration a node's dqlite role.

Signed-off-by: Wesley Hershberger <[email protected]>
...for cluster recovery

Signed-off-by: Wesley Hershberger <[email protected]>
Since this is a somewhat arbitrary check, we should make sure to do it
before we've mutated the dqlite dir state.

Signed-off-by: Wesley Hershberger <[email protected]>
`lxd cluster edit` on each node is no longer supported.

Signed-off-by: Wesley Hershberger <[email protected]>
@tomponline tomponline changed the title Cluster recovery Cluster recovery improvements Sep 4, 2024
@tomponline tomponline merged commit 78e006e into canonical:main Sep 4, 2024
29 checks passed
@MggMuggins MggMuggins deleted the cluster-recovery branch September 4, 2024 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Documentation needs updating
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster Recovery Process misuses dqlite
3 participants