-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] After rebooting Harvester nodes they forget their identity #7006
Comments
Is it in any way possible those other nodes somehow booted back into the installer? Just asking because the system hostname is |
I tried that, and it doesn't. Plus the installer has a different screen, it has 'create/join/etc cluster' and not 'node status: NotReady, management status: NotReady' |
Hey, I don't suppose anyone has any idea what might be wrong, or how I can recover the cluster? |
Hey, I don't suppose anyone has any ideas? |
@matj-sag Sorry to hear such story. We have following KB, which describes a secure way to shutdown a running cluster and power on them on another physical location, but unfortuantely your cluster has been powered off. For current situation, the first step is to check if the IPs on each node are trully same as before: (1) Is the management nic correctly connected to the TOR (switch), is the link up? |
OK, I'm on a liveCD on a broken node and comparing it to the working node. oem/harvester.config is correct on the broken node, harvester/defaultdisk still contains all my images I'm not sure how to check the other things that are on /dev/sda5 though, because on the working node it appears to be mounted multiple times on different paths with different content, and I don't know how it's doing that. |
Hmm, I wasn't expecting that to be necessary, I did a clean soft power off on each node - but surely the cluster is resilient against power going out in the datacenter and uncleanly powering them all off anyway? To confirm your questions:
As I said from the other message that crossed with yours, the configuration of the cluster and (at least some) of the cluster data is clearly still there. I tried booting a node with the recovery option and it just dropped me into the installer (I definitely don't want to do a clean node install and wipe out my cluster data). Booting with the debug option just got me back to the same unhelpful NotReady screen with the wrong node name |
OK, I've found the other bind mounts under /usr/local/.state/ and they seem to contain my data still. I think whatever magic is taking the harvester config from oem/harvester.config and applies it to the /etc overlay is just not doing it right |
Did you try to run If you can login, check ip link, ip addr on those nodes; for those nodes without IP, are the links up? When possible, please post some screen outputs for us to check, at the moment, we can only imagine something. |
The node password did not work. I thought I had tried rancher, but now I can get in with rancher as the password. Looking at that now:
As I surmised - whatever is meant to copy from harvester.config on boot has not done any of that |
check |
The first 3 lines of console.log has already told something. You may check the nodes, there should be one node with
|
If I do locate the firstHost, what does that give me? It's in the same state |
Let's check if run:
a normal running cluster has such output:
|
Yes, 90_custom.yaml is correct on the node, yq reports no errors and can return the appropriate values for those queries |
what actually is the process that should have applied those? |
In short, the |
Take the first 2 lines of dmesg log from your nodes:
|
OK, I think the issue is that a different /oem file (which we added custom) isn't valid yq. I'm investigating it now |
We followed the instructions on https://harvesterhci.io/kb/install_netapp_trident_csi/ and the /oem/99_multipathd.yaml file it asked me to create isn't valid. I've just tried it again copying and pasting from the page, and still yq fails. I'm going to just remove the file for now to get the nodes booting |
remove any invalid yaml files under |
Yes, I've fixed that now, everything has booted and all the VMs are running again. Thanks a lot for your help. Having figured this out I have three requests for improvement to hopefully try and avoid this situation again:
|
Good to hear the cluster is back. Thanks for reporting this issue, I agree that we need to add some enhancements to avoid such awkward state. |
To follow up on the 'fix the CSI article' - several of the nodes didn't come up with multipathd running, which is what the invalid file in /oem was meant to do, so it would be good to know how to do that correctly |
I have a cluster of 7 identical nodes. We just moved them to another datacenter, which entailed powering them all down and back up again. Now I power them up, 1 of them has come up correctly with the right hostname, but all the other 6 are showing "Hostname: rancher", no IP address and everything in Status: NotReady, and when I go to a console it doesn't accept my node password.
The machines are all configured with a static IP address and link is up on that network interface.
Please can someone help, this is urgent, everything is down, the one node is not enough to form a cluster, so even that does not have management ready yet.
The text was updated successfully, but these errors were encountered: