Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] After rebooting Harvester nodes they forget their identity #7006

Open
matj-sag opened this issue Nov 14, 2024 · 27 comments
Open

[BUG] After rebooting Harvester nodes they forget their identity #7006

matj-sag opened this issue Nov 14, 2024 · 27 comments
Labels
area/installer Harvester installer candidate/v1.5.0 kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/always Reproducible 100% of the time reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)

Comments

@matj-sag
Copy link

I have a cluster of 7 identical nodes. We just moved them to another datacenter, which entailed powering them all down and back up again. Now I power them up, 1 of them has come up correctly with the right hostname, but all the other 6 are showing "Hostname: rancher", no IP address and everything in Status: NotReady, and when I go to a console it doesn't accept my node password.

The machines are all configured with a static IP address and link is up on that network interface.

Please can someone help, this is urgent, everything is down, the one node is not enough to form a cluster, so even that does not have management ready yet.

@matj-sag matj-sag added kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one labels Nov 14, 2024
@tserong
Copy link
Contributor

tserong commented Nov 15, 2024

Is it in any way possible those other nodes somehow booted back into the installer? Just asking because the system hostname is rancher when the installer is running. If you are somehow in the installer environment, you should be able to login with username rancher and password rancher.

@matj-sag
Copy link
Author

I tried that, and it doesn't. Plus the installer has a different screen, it has 'create/join/etc cluster' and not 'node status: NotReady, management status: NotReady'

@matj-sag
Copy link
Author

Hey, I don't suppose anyone has any idea what might be wrong, or how I can recover the cluster?

@matj-sag
Copy link
Author

Hey, I don't suppose anyone has any ideas?

@w13915984028
Copy link
Member

@matj-sag Sorry to hear such story. We have following KB, which describes a secure way to shutdown a running cluster and power on them on another physical location, but unfortuantely your cluster has been powered off.

harvester/harvesterhci.io#68

For current situation, the first step is to check if the IPs on each node are trully same as before:

(1) Is the management nic correctly connected to the TOR (switch), is the link up?
(2) If you can't login to the node, you may do some debug on the switch, check if ARP/IP is working correctly?
(3) The original last 3 management nodes need to be back to help ectd starts first.

@matj-sag
Copy link
Author

OK, I'm on a liveCD on a broken node and comparing it to the working node.

oem/harvester.config is correct on the broken node, harvester/defaultdisk still contains all my images

I'm not sure how to check the other things that are on /dev/sda5 though, because on the working node it appears to be mounted multiple times on different paths with different content, and I don't know how it's doing that.

@matj-sag
Copy link
Author

@matj-sag Sorry to hear such story. We have following KB, which describes a secure way to shutdown a running cluster and power on them on another physical location, but unfortuantely your cluster has been powered off.

harvester/harvesterhci.io#68

For current situation, the first step is to check if the IPs on each node are trully same as before:

(1) Is the management nic correctly connected to the TOR (switch), is the link up? (2) If you can't login to the node, you may do some debug on the switch, check if ARP/IP is working correctly? (3) The original last 3 management nodes need to be back to help ectd starts first.

Hmm, I wasn't expecting that to be necessary, I did a clean soft power off on each node - but surely the cluster is resilient against power going out in the datacenter and uncleanly powering them all off anyway?

To confirm your questions:

  • yes the idrac nic is connected and I can get to the idrac on all nodes
  • yes the same physical connection is made and link is up on all the nodes
  • the IP addresses are statically configured, but it seems the nodes aren't picking up their configuration they had before, so they are not bringing up the nodes on the static IPs, nor are they recognizing their own host names (the new location has the same network addresses as the old location
  • I agree we need at least 2 (presumably) of the previous management nodes to come up. Unfortunately, 6 of my 7 nodes have suffered in the same way, so I definitely don't have enough up

As I said from the other message that crossed with yours, the configuration of the cluster and (at least some) of the cluster data is clearly still there. I tried booting a node with the recovery option and it just dropped me into the installer (I definitely don't want to do a clean node install and wipe out my cluster data). Booting with the debug option just got me back to the same unhelpful NotReady screen with the wrong node name

@matj-sag
Copy link
Author

OK, I've found the other bind mounts under /usr/local/.state/ and they seem to contain my data still.

I think whatever magic is taking the harvester config from oem/harvester.config and applies it to the /etc overlay is just not doing it right

@w13915984028
Copy link
Member

w13915984028 commented Nov 18, 2024

Did you try to run F12 or CTL+ALT+F2 on the console screen of the nodes, and then login to the node shell? did the set password or default password work?

If you can login, check ip link, ip addr on those nodes; for those nodes without IP, are the links up?

When possible, please post some screen outputs for us to check, at the moment, we can only imagine something.

@matj-sag
Copy link
Author

Did you try to run F12 or CTL+ALT+F2 on the console screen of the nodes, and then login to the node shell? did the set password or default password work?

If you can login, check ip link, ip addr on those nodes; for those nodes without IP, are the links up?

When possible, please post some screen outputs for us to check, at the moment, we can only imagine something.

The node password did not work. I thought I had tried rancher, but now I can get in with rancher as the password. Looking at that now:

  • /etc/HOSTNAME is rancher (wrong) and the password for rancher in /etc/shadow doesn't match the one in harvester.config (as expected since it let me log in with rancher)
  • /etc/sysconfig/network/ifcfg-mgmt-br doesn't exist (which contains the IPADDR in my working node)
  • /oem/harvester.config is correct
  • as a result of the ip and hostname issues, no interface is up with no address at all
  • The interface exists and has link (but no ip) with the same name and mac address as is listed in the harvester.config

As I surmised - whatever is meant to copy from harvester.config on boot has not done any of that

@matj-sag
Copy link
Author

Some screenshots. The broken nodes all boot like this:
image
The good node looks like this:
image

@matj-sag
Copy link
Author

On the broken node when logging in as rancher you can see from /oem/harvester.config:
image
image

@w13915984028
Copy link
Member

check /var/log/console.log and dmesg on each node, check if there are early stage errors/warnings.

@matj-sag
Copy link
Author

matj-sag commented Nov 18, 2024

nothing in dmesg (already checked that one). console.log just looks like it's not been tailored correctly:
image
(this is the result of grep 2024-11-14, which is when they were powered back on, those are the first lines matching that, nothing earlier)

@w13915984028
Copy link
Member

The first 3 lines of console.log has already told something. You may check the nodes, there should be one node with firstHost:true, it is the first installed node of the cluster.

... msg="state: {installed:false firstHost:true managementURL:}"

@matj-sag
Copy link
Author

If I do locate the firstHost, what does that give me? It's in the same state

@w13915984028
Copy link
Member

w13915984028 commented Nov 18, 2024

Let's check if /oem/90_custom.yaml is still valid on each node:

run:

lsblk
ls /oem -alth
grep "mgmt-b" /oem/90_custom.yaml -2

a normal running cluster has such output:


            - path: /etc/sysconfig/network/ifcfg-mgmt-bo
              permissions: 384
              owner: 0
...
            - path: /etc/sysconfig/network/ifcfg-mgmt-br
              permissions: 384
              owner: 0
...

yq  /oem/90_custom.yaml 
it should not have any errors

yq ".stages.initramfs[0].hostname" /oem/90_custom.yaml

yq ".stages.initramfs[0].files[-1]" /oem/90_custom.yaml 
path: /etc/sysconfig/network/ifcfg-mgmt-br
permissions: 384
owner: 0
group: 0
content: |+
  STARTMODE='onboot'
  BOOTPROTO='dhcp'
  BRIDGE='yes'
  BRIDGE_STP='off'
  BRIDGE_FORWARDDELAY='0'
  BRIDGE_PORTS='mgmt-bo'
  PRE_UP_SCRIPT="wicked:setup_bridge.sh"
  POST_UP_SCRIPT="wicked:setup_bridge.sh"




  DHCLIENT_SET_DEFAULT_ROUTE='yes'


encoding: ""
ownerstring: ""

@matj-sag
Copy link
Author

Yes, 90_custom.yaml is correct on the node, yq reports no errors and can return the appropriate values for those queries

@matj-sag
Copy link
Author

what actually is the process that should have applied those?

@w13915984028
Copy link
Member

In short, the /oem/90_custom.yaml guides the OS to start with those customized information.

@w13915984028
Copy link
Member

Take the first 2 lines of dmesg log from your nodes:

dmesg | head -2
[ 0.000000] Linux version 5.14.21-150500.55.83-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.43.1.20240828-150100.7.49) #1 SMP PREEMPT_DYNAMIC Wed Oct 2 08:09:07 UTC 2024 (0d53847)
[ 0.000000] Command line: BOOT_IMAGE=(loop0)/boot/vmlinuz console=tty1 root=LABEL=COS_STATE cos-img/filename=/cOS/active.img panic=0 net.ifnames=1 rd.cos.oemlabel=COS_OEM rd.cos.mount=LABEL=COS_OEM:/oem rd.cos.mount=LABEL=COS_PERSISTENT:/usr/local rd.cos.oemtimeout=120 audit=1 audit_backlog_limit=8192 intel_iommu=on amd_iommu=on iommu=pt multipath=off rd.emergency=reboot rd.shell=0 panic=5 systemd.crash_reboot systemd.crash_shell=0

@matj-sag
Copy link
Author

OK, I think the issue is that a different /oem file (which we added custom) isn't valid yq. I'm investigating it now

@matj-sag
Copy link
Author

We followed the instructions on https://harvesterhci.io/kb/install_netapp_trident_csi/ and the /oem/99_multipathd.yaml file it asked me to create isn't valid. I've just tried it again copying and pasting from the page, and still yq fails. I'm going to just remove the file for now to get the nodes booting

@w13915984028
Copy link
Member

remove any invalid yaml files under /oem path

@matj-sag
Copy link
Author

Yes, I've fixed that now, everything has booted and all the VMs are running again. Thanks a lot for your help.

Having figured this out I have three requests for improvement to hopefully try and avoid this situation again:

  • Fix the KB article about the trident CSI to not contain an invalid file that will stop the cluster booting
  • Provide REALLY OBVIOUS ERROR MESSAGES SOMEWHERE that the oem files are invalid to explain what's going on, so that people can fix it
  • Potentially test and ignore invalid files so that the valid ones can still apply and the cluster can come up

@w13915984028
Copy link
Member

Good to hear the cluster is back.

Thanks for reporting this issue, I agree that we need to add some enhancements to avoid such awkward state.

@w13915984028 w13915984028 added severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) reproduce/always Reproducible 100% of the time area/installer Harvester installer labels Nov 18, 2024
@matj-sag
Copy link
Author

To follow up on the 'fix the CSI article' - several of the nodes didn't come up with multipathd running, which is what the invalid file in /oem was meant to do, so it would be good to know how to do that correctly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/installer Harvester installer candidate/v1.5.0 kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/always Reproducible 100% of the time reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Projects
Development

No branches or pull requests

3 participants