[BUG] After rebooting Harvester nodes they forget their identity #7006

matj-sag · 2024-11-14T11:31:08Z

I have a cluster of 7 identical nodes. We just moved them to another datacenter, which entailed powering them all down and back up again. Now I power them up, 1 of them has come up correctly with the right hostname, but all the other 6 are showing "Hostname: rancher", no IP address and everything in Status: NotReady, and when I go to a console it doesn't accept my node password.

The machines are all configured with a static IP address and link is up on that network interface.

Please can someone help, this is urgent, everything is down, the one node is not enough to form a cluster, so even that does not have management ready yet.

tserong · 2024-11-15T02:11:18Z

Is it in any way possible those other nodes somehow booted back into the installer? Just asking because the system hostname is rancher when the installer is running. If you are somehow in the installer environment, you should be able to login with username rancher and password rancher.

matj-sag · 2024-11-15T09:21:51Z

I tried that, and it doesn't. Plus the installer has a different screen, it has 'create/join/etc cluster' and not 'node status: NotReady, management status: NotReady'

matj-sag · 2024-11-15T16:22:53Z

Hey, I don't suppose anyone has any idea what might be wrong, or how I can recover the cluster?

matj-sag · 2024-11-18T09:08:18Z

Hey, I don't suppose anyone has any ideas?

w13915984028 · 2024-11-18T10:35:37Z

@matj-sag Sorry to hear such story. We have following KB, which describes a secure way to shutdown a running cluster and power on them on another physical location, but unfortuantely your cluster has been powered off.

harvester/harvesterhci.io#68

For current situation, the first step is to check if the IPs on each node are trully same as before:

(1) Is the management nic correctly connected to the TOR (switch), is the link up?
(2) If you can't login to the node, you may do some debug on the switch, check if ARP/IP is working correctly?
(3) The original last 3 management nodes need to be back to help ectd starts first.

matj-sag · 2024-11-18T10:41:07Z

OK, I'm on a liveCD on a broken node and comparing it to the working node.

oem/harvester.config is correct on the broken node, harvester/defaultdisk still contains all my images

I'm not sure how to check the other things that are on /dev/sda5 though, because on the working node it appears to be mounted multiple times on different paths with different content, and I don't know how it's doing that.

matj-sag · 2024-11-18T10:49:17Z

@matj-sag Sorry to hear such story. We have following KB, which describes a secure way to shutdown a running cluster and power on them on another physical location, but unfortuantely your cluster has been powered off.

harvester/harvesterhci.io#68

For current situation, the first step is to check if the IPs on each node are trully same as before:

(1) Is the management nic correctly connected to the TOR (switch), is the link up? (2) If you can't login to the node, you may do some debug on the switch, check if ARP/IP is working correctly? (3) The original last 3 management nodes need to be back to help ectd starts first.

Hmm, I wasn't expecting that to be necessary, I did a clean soft power off on each node - but surely the cluster is resilient against power going out in the datacenter and uncleanly powering them all off anyway?

To confirm your questions:

yes the idrac nic is connected and I can get to the idrac on all nodes
yes the same physical connection is made and link is up on all the nodes
the IP addresses are statically configured, but it seems the nodes aren't picking up their configuration they had before, so they are not bringing up the nodes on the static IPs, nor are they recognizing their own host names (the new location has the same network addresses as the old location
I agree we need at least 2 (presumably) of the previous management nodes to come up. Unfortunately, 6 of my 7 nodes have suffered in the same way, so I definitely don't have enough up

As I said from the other message that crossed with yours, the configuration of the cluster and (at least some) of the cluster data is clearly still there. I tried booting a node with the recovery option and it just dropped me into the installer (I definitely don't want to do a clean node install and wipe out my cluster data). Booting with the debug option just got me back to the same unhelpful NotReady screen with the wrong node name

matj-sag · 2024-11-18T10:57:58Z

OK, I've found the other bind mounts under /usr/local/.state/ and they seem to contain my data still.

I think whatever magic is taking the harvester config from oem/harvester.config and applies it to the /etc overlay is just not doing it right

w13915984028 · 2024-11-18T11:18:21Z

Did you try to run F12 or CTL+ALT+F2 on the console screen of the nodes, and then login to the node shell? did the set password or default password work?

If you can login, check ip link, ip addr on those nodes; for those nodes without IP, are the links up?

When possible, please post some screen outputs for us to check, at the moment, we can only imagine something.

matj-sag · 2024-11-18T11:30:45Z

Did you try to run F12 or CTL+ALT+F2 on the console screen of the nodes, and then login to the node shell? did the set password or default password work?

If you can login, check ip link, ip addr on those nodes; for those nodes without IP, are the links up?

When possible, please post some screen outputs for us to check, at the moment, we can only imagine something.

The node password did not work. I thought I had tried rancher, but now I can get in with rancher as the password. Looking at that now:

/etc/HOSTNAME is rancher (wrong) and the password for rancher in /etc/shadow doesn't match the one in harvester.config (as expected since it let me log in with rancher)
/etc/sysconfig/network/ifcfg-mgmt-br doesn't exist (which contains the IPADDR in my working node)
/oem/harvester.config is correct
as a result of the ip and hostname issues, no interface is up with no address at all
The interface exists and has link (but no ip) with the same name and mac address as is listed in the harvester.config

As I surmised - whatever is meant to copy from harvester.config on boot has not done any of that

matj-sag · 2024-11-18T11:35:53Z

Some screenshots. The broken nodes all boot like this:

The good node looks like this:

matj-sag · 2024-11-18T11:37:56Z

On the broken node when logging in as rancher you can see from /oem/harvester.config:

w13915984028 · 2024-11-18T11:39:08Z

check /var/log/console.log and dmesg on each node, check if there are early stage errors/warnings.

matj-sag · 2024-11-18T11:50:34Z

nothing in dmesg (already checked that one). console.log just looks like it's not been tailored correctly:

(this is the result of grep 2024-11-14, which is when they were powered back on, those are the first lines matching that, nothing earlier)

w13915984028 · 2024-11-18T12:04:01Z

The first 3 lines of console.log has already told something. You may check the nodes, there should be one node with firstHost:true, it is the first installed node of the cluster.

... msg="state: {installed:false firstHost:true managementURL:}"

matj-sag · 2024-11-18T12:19:52Z

If I do locate the firstHost, what does that give me? It's in the same state

w13915984028 · 2024-11-18T15:54:58Z

Let's check if /oem/90_custom.yaml is still valid on each node:

run:

lsblk
ls /oem -alth
grep "mgmt-b" /oem/90_custom.yaml -2

a normal running cluster has such output:


            - path: /etc/sysconfig/network/ifcfg-mgmt-bo
              permissions: 384
              owner: 0
...
            - path: /etc/sysconfig/network/ifcfg-mgmt-br
              permissions: 384
              owner: 0
...

yq  /oem/90_custom.yaml 
it should not have any errors

yq ".stages.initramfs[0].hostname" /oem/90_custom.yaml

yq ".stages.initramfs[0].files[-1]" /oem/90_custom.yaml 
path: /etc/sysconfig/network/ifcfg-mgmt-br
permissions: 384
owner: 0
group: 0
content: |+
  STARTMODE='onboot'
  BOOTPROTO='dhcp'
  BRIDGE='yes'
  BRIDGE_STP='off'
  BRIDGE_FORWARDDELAY='0'
  BRIDGE_PORTS='mgmt-bo'
  PRE_UP_SCRIPT="wicked:setup_bridge.sh"
  POST_UP_SCRIPT="wicked:setup_bridge.sh"




  DHCLIENT_SET_DEFAULT_ROUTE='yes'


encoding: ""
ownerstring: ""

matj-sag · 2024-11-18T16:02:05Z

Yes, 90_custom.yaml is correct on the node, yq reports no errors and can return the appropriate values for those queries

matj-sag · 2024-11-18T16:02:21Z

what actually is the process that should have applied those?

w13915984028 · 2024-11-18T16:04:15Z

In short, the /oem/90_custom.yaml guides the OS to start with those customized information.

w13915984028 · 2024-11-18T16:12:13Z

Take the first 2 lines of dmesg log from your nodes:

dmesg | head -2
[ 0.000000] Linux version 5.14.21-150500.55.83-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.43.1.20240828-150100.7.49) #1 SMP PREEMPT_DYNAMIC Wed Oct 2 08:09:07 UTC 2024 (0d53847)
[ 0.000000] Command line: BOOT_IMAGE=(loop0)/boot/vmlinuz console=tty1 root=LABEL=COS_STATE cos-img/filename=/cOS/active.img panic=0 net.ifnames=1 rd.cos.oemlabel=COS_OEM rd.cos.mount=LABEL=COS_OEM:/oem rd.cos.mount=LABEL=COS_PERSISTENT:/usr/local rd.cos.oemtimeout=120 audit=1 audit_backlog_limit=8192 intel_iommu=on amd_iommu=on iommu=pt multipath=off rd.emergency=reboot rd.shell=0 panic=5 systemd.crash_reboot systemd.crash_shell=0

matj-sag · 2024-11-18T16:15:34Z

OK, I think the issue is that a different /oem file (which we added custom) isn't valid yq. I'm investigating it now

matj-sag · 2024-11-18T16:19:13Z

We followed the instructions on https://harvesterhci.io/kb/install_netapp_trident_csi/ and the /oem/99_multipathd.yaml file it asked me to create isn't valid. I've just tried it again copying and pasting from the page, and still yq fails. I'm going to just remove the file for now to get the nodes booting

w13915984028 · 2024-11-18T16:32:08Z

remove any invalid yaml files under /oem path

matj-sag · 2024-11-18T17:00:02Z

Yes, I've fixed that now, everything has booted and all the VMs are running again. Thanks a lot for your help.

Having figured this out I have three requests for improvement to hopefully try and avoid this situation again:

Fix the KB article about the trident CSI to not contain an invalid file that will stop the cluster booting
Provide REALLY OBVIOUS ERROR MESSAGES SOMEWHERE that the oem files are invalid to explain what's going on, so that people can fix it
Potentially test and ignore invalid files so that the valid ones can still apply and the cluster can come up

w13915984028 · 2024-11-18T17:03:16Z

Good to hear the cluster is back.

Thanks for reporting this issue, I agree that we need to add some enhancements to avoid such awkward state.

matj-sag · 2024-11-19T10:32:43Z

To follow up on the 'fix the CSI article' - several of the nodes didn't come up with multipathd running, which is what the invalid file in /oem was meant to do, so it would be good to know how to do that correctly

matj-sag added kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one labels Nov 14, 2024

w13915984028 added the candidate/v1.5.0 label Nov 18, 2024

w13915984028 added severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) reproduce/always Reproducible 100% of the time area/installer Harvester installer labels Nov 18, 2024

rebeccazzzz added this to Community Issue Review Dec 5, 2024

github-project-automation bot moved this to New in Community Issue Review Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] After rebooting Harvester nodes they forget their identity #7006

[BUG] After rebooting Harvester nodes they forget their identity #7006

matj-sag commented Nov 14, 2024

tserong commented Nov 15, 2024

matj-sag commented Nov 15, 2024

matj-sag commented Nov 15, 2024

matj-sag commented Nov 18, 2024

w13915984028 commented Nov 18, 2024

matj-sag commented Nov 18, 2024

matj-sag commented Nov 18, 2024

matj-sag commented Nov 18, 2024

w13915984028 commented Nov 18, 2024 •

edited

Loading

matj-sag commented Nov 18, 2024

matj-sag commented Nov 18, 2024

matj-sag commented Nov 18, 2024

w13915984028 commented Nov 18, 2024

matj-sag commented Nov 18, 2024 •

edited

Loading

w13915984028 commented Nov 18, 2024

matj-sag commented Nov 18, 2024

w13915984028 commented Nov 18, 2024 •

edited

Loading

matj-sag commented Nov 18, 2024

matj-sag commented Nov 18, 2024

w13915984028 commented Nov 18, 2024

w13915984028 commented Nov 18, 2024

matj-sag commented Nov 18, 2024

matj-sag commented Nov 18, 2024

w13915984028 commented Nov 18, 2024

matj-sag commented Nov 18, 2024

w13915984028 commented Nov 18, 2024

matj-sag commented Nov 19, 2024

[BUG] After rebooting Harvester nodes they forget their identity #7006

[BUG] After rebooting Harvester nodes they forget their identity #7006

Comments

matj-sag commented Nov 14, 2024

tserong commented Nov 15, 2024

matj-sag commented Nov 15, 2024

matj-sag commented Nov 15, 2024

matj-sag commented Nov 18, 2024

w13915984028 commented Nov 18, 2024

matj-sag commented Nov 18, 2024

matj-sag commented Nov 18, 2024

matj-sag commented Nov 18, 2024

w13915984028 commented Nov 18, 2024 • edited Loading

matj-sag commented Nov 18, 2024

matj-sag commented Nov 18, 2024

matj-sag commented Nov 18, 2024

w13915984028 commented Nov 18, 2024

matj-sag commented Nov 18, 2024 • edited Loading

w13915984028 commented Nov 18, 2024

matj-sag commented Nov 18, 2024

w13915984028 commented Nov 18, 2024 • edited Loading

matj-sag commented Nov 18, 2024

matj-sag commented Nov 18, 2024

w13915984028 commented Nov 18, 2024

w13915984028 commented Nov 18, 2024

matj-sag commented Nov 18, 2024

matj-sag commented Nov 18, 2024

w13915984028 commented Nov 18, 2024

matj-sag commented Nov 18, 2024

w13915984028 commented Nov 18, 2024

matj-sag commented Nov 19, 2024

w13915984028 commented Nov 18, 2024 •

edited

Loading

matj-sag commented Nov 18, 2024 •

edited

Loading

w13915984028 commented Nov 18, 2024 •

edited

Loading