Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Version for Pre-Upgrade Check. #15

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

ParadoxGuitarist
Copy link

@ParadoxGuitarist ParadoxGuitarist commented Sep 18, 2024

There was some versioning for the update check.sh that seemed to be lagging behind, but after talking with support it seems like it's good for all the other versions. I added a new folder with a 1.x since it seems to be expected that it'll continue to work for 1.4 as well. A few other things this PR adds:

  • A check to see if the host running the script matches the name of one of the Control-Plane Nodes (as req in the README.md)
  • A check to see if the local-kubeconfig secret in the fleet-local namespace is missing a label that causes a known issue in the 1.2.2 > 1.3.1 upgrade.
  • A check to see if internal certs have expired that can cause the upgrade to get stuck or fail (known issue)
  • Bigger formatting with spaces between tests to help with readability. (personally was having trouble groking the output)
  • Added runtime flags
  • Logging functions and a flag (-l /path/to/file.log) will generate a log file.
  • Fixed a few grammatical statements.
  • Moved some of the printed lines to verbose (-v flag) messages to help highlight/emphasize more important information (errors and failures). (Note logs, if enabled, always contain verbose messages)
  • Updated README.md with new example and additional information.
  • Gives a summary on which sections failed at the conclusion.

My feeling won't get hurt at all if you change/alter any parts of the PR. I'm also happy to make the changes myself if you identify any.

@ParadoxGuitarist
Copy link
Author

ParadoxGuitarist commented Sep 19, 2024

Since we were waiting I saw that there's a known issue that can prevented by adding a label to a secret: kubectl label secret local-kubeconfig -n fleet-local cluster.x-k8s.io/cluster-name=local

So I went ahead and added test for that label with the suggest fix from the docs. Looks like the upstream issue is this:
harvester/harvester#6041

@ParadoxGuitarist
Copy link
Author

Adding another check for this known issue: harvester/harvester#3863

@starbops starbops self-requested a review September 23, 2024 08:33
Copy link
Member

@starbops starbops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ParadoxGuitarist, Thank you for the awesome work! It looks pretty neat. I have only some suggestions. Please kindly take a look.

pre-check/v1.x/check.sh Outdated Show resolved Hide resolved
pre-check/v1.x/check.sh Show resolved Hide resolved
pre-check/v1.x/check.sh Show resolved Hide resolved
Copy link
Member

@starbops starbops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the fix does not take effect. This is a CP node and the cluster does not have the rancher-monitoring add-on enabled.

node-0:~ # check.sh
==============================

Starting Host check...
Host Test: Pass

==============================

Starting Certificates check...
Certificates Test: Pass

==============================

Starting Node Free Space check...
Error from server (NotFound): services "rancher-monitoring-prometheus" not found
Script wasn't able to get valid response from the API.
You may need to log into each of the nodes and run 'df -h /usr/local' to ensure there's more than 30 GB of free space available.
Node-Free-Space Test: Failed

==============================

Starting Helm Bundle status check...
Helm-Bundles Test: Pass

==============================

Starting Harvester Bundle status check...
Harvester-Bundles Test: Pass

==============================

Starting Node Status check...
Node-Status Test: Pass

==============================

Starting CAPI Cluster State check...
CAPI-Cluster-State Test: Pass

==============================

Starting CAPI Machine Count check...
CAPI-Machine-Count Test: Pass

==============================

Starting CAPI Machine State check...
CAPI-Machine-State Test: Pass

==============================

Starting Longhorn Volume Health Status check...
Longhorn-Volume-Health-Status Test: Pass

==============================

Starting Stale Longhorn Volumes check...
Stale-Longhorn-Volumes Test: Pass

==============================

Starting Pod Status check...
Pod-Status Test: Pass

==============================

Starting Kubeconfig Secret check...
Kubeconfig Secret Test: Pass

==============================

WARN: There are 1 failing checks: Node-Free-Space

@ParadoxGuitarist
Copy link
Author

ParadoxGuitarist commented Oct 23, 2024

It seems the fix does not take effect. This is a CP node and the cluster does not have the rancher-monitoring add-on enabled.

node-0:~ # check.sh
==============================

Starting Host check...
Host Test: Pass

==============================

Starting Certificates check...
Certificates Test: Pass

==============================

Starting Node Free Space check...
Error from server (NotFound): services "rancher-monitoring-prometheus" not found
Script wasn't able to get valid response from the API.
You may need to log into each of the nodes and run 'df -h /usr/local' to ensure there's more than 30 GB of free space available.
Node-Free-Space Test: Failed

==============================
 ... 
WARN: There are 1 failing checks: Node-Free-Space

It failed, (which is what I think we want) but it didn't give out the correct message. Seems like that string isn't empty? Can you pull the verbose logs and see what it's trying to look up with the IP? It should be in there from this line.

@starbops
Copy link
Member

Aha, there's still an output: null

Starting Node Free Space check...
Error from server (NotFound): services "rancher-monitoring-prometheus" not found
Trying to get results from null:9090
Script wasn't able to get valid response from the API.
You may need to log into each of the nodes and run 'df -h /usr/local' to ensure there's more than 30 GB of free space available.
Note: This check requires that the script is running from a control-plane node and that rancher-monitoring Addon is enabled.
Node-Free-Space Test: Failed

Copy link
Member

@starbops starbops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, really appreciate the contribution @ParadoxGuitarist! Normally we wait for two approves before merging the PR. cc @bk201

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants