-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] On v1.2.1 -> v1.2-head -> v1.3.0-rc4 Harvester Airgapped, RKE1 (v1.26.X -> v1.27.10) Node Fails to Update To Harvester Cloud Provider: harvester-cloud-provider:102.0.2+up0.2.3 on #5348
Comments
I can reproduce this even with Rancher v2.7.10 with RKE1 v1.26.14 + harvester-cloud-provider 102.0.1+up0.1.14 on a Harvester v1.3.0-rc4 cluster. After initiated the upgrade for harvester-cloud-provider (targeting 102.0.2+up0.2.3), the Helm upgrade job cannot finish: Checking the newly spawned pod (with container image v0.2.0), it seems that the new pod cannot be scheduled to any of the nodes. apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2024-03-13T04:01:09Z"
generateName: harvester-cloud-provider-65556b59c9-
labels:
app.kubernetes.io/instance: harvester-cloud-provider
app.kubernetes.io/name: harvester-cloud-provider
pod-template-hash: 65556b59c9
name: harvester-cloud-provider-65556b59c9-228t5
namespace: kube-system
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: harvester-cloud-provider-65556b59c9
uid: 8e2f0fa9-3284-41ca-af36-bc6e1fbcef0e
resourceVersion: "5134"
uid: a645176c-14bf-4b85-b64c-ca0f961b94f9
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- harvester-cloud-provider
topologyKey: kubernetes.io/hostname
containers:
- args:
- --cloud-config=/etc/kubernetes/cloud-config
- --cluster-name=frisbee
command:
- harvester-cloud-provider
image: rancher/harvester-cloud-provider:v0.2.0
imagePullPolicy: IfNotPresent
name: harvester-cloud-provider
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/kubernetes/cloud-config
name: cloud-config
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-7hmj2
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostNetwork: true
nodeSelector:
kubernetes.io/os: linux
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: harvester-cloud-provider
serviceAccountName: harvester-cloud-provider
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node.cloudprovider.kubernetes.io/uninitialized
operator: Equal
value: "true"
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Equal
- effect: NoExecute
key: node-role.kubernetes.io/etcd
operator: Equal
- effect: NoSchedule
key: cattle.io/os
operator: Equal
value: linux
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- hostPath:
path: /etc/kubernetes/cloud-config
type: File
name: cloud-config
- name: kube-api-access-7hmj2
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-03-13T04:01:09Z"
message: '0/1 nodes are available: 1 node(s) didn''t match pod anti-affinity rules.
preemption: 0/1 nodes are available: 1 No preemption victims found for incoming
pod..'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: BestEffort
This is because it's essentially a single-node RKE1 cluster. The scheduler cannot violate the PodAntiAffinity rule as the old one is still running. $ kubectl -n kube-system get po -l app.kubernetes.io/name=harvester-cloud-provider
NAME READY STATUS RESTARTS AGE
harvester-cloud-provider-65556b59c9-228t5 0/1 Pending 0 13m
harvester-cloud-provider-856bcf647c-lqzln 1/1 Running 0 26m So after 10 minutes, the Helm upgrade job failed: Wondering if @irishgordo observed the same situation at that moment? The weird parts in the original issue:
|
I cannot reproduce it with the following steps:
The upgrade of the harvester-cloud-provider chart was indeed unsuccessful. The Helm upgrade job failed with the same error message as the original one: However, I didn't observe the same phenomenon: The harvester-cloud-provider's image tag was changed to v0.2.0. It's almost the same as my findings in the above comment. The new harvester-cloud-provider pod cannot be scheduled to the node because of the pod anti-affinity rule and remains pending. FWIW, removing the old harvester-cloud-provider pod could make the deployment healthy. Then we need to apply the workaround mentioned here to make sure the load balancer functionality works as expected (this is fixed in harvester-cloud-provider chart 0.2.4.) |
@starbops -thanks so much for the investigation 😄 But yes, the work-around aspect is very effeicent. In general the steps are:
@starbops should I leave this open to track, or I can go ahead and close this out too? |
@irishgordo Thanks for clarifying! Since this will only happen to single-node RKE1 guest clusters, and we've figured out the workaround, we could lower the severity and see if adding a descriptive note on the documentation for addressing such a situation with a workaround is enough. cc @bk201 |
@starbops - that sounds good 😄 👍 - I've lowered the severity. |
We can add the deployment strategy |
I found that adjusting the default rolling update strategy is better than changing it to the
|
Pre Ready-For-Testing Checklist
|
Automation e2e test issue: harvester/tests#1326 |
Test PASS with Rancher Prime Note Note that this issue fixed in Environment
Prerequisite
Test StepsFootnotes |
thanks for this check @albinsun 👍 |
Describe the bug
On an air-gapped, upgraded, setup.
Harvester Cloud Provider fails to update on an RKE1 Node at v1.27.10.
Encountering:
Stays stuck at charts using:
airgap-docker-registry.192.168.120.236.sslip.io:5000/rancher/harvester-cloud-provider:v0.1.5
But with:
Image is visible that's needed for new chart.
Additionally:
rancher/mirrored-kube-vip-kube-vip-iptables is present on the docker registry
And the .tgz can be downloaded correctly from:
https://rancher-extra.192.168.120.196.sslip.io/k8s/clusters/c-4nfvl/v1/catalog.cattle.io.clusterrepos/rancher-charts?chartName=harvester-cloud-provider&link=chart&version=102.0.2%2Bup0.2.3
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Harvester Cloud Provider Chart to be able to be updated from air-gapped upgraded v1.3.0-rc4 Harvester w/ v2.7.11 Rancher with v1.27.X RKE1 based node running on Harvester
Environment
Workaround
Additional context
The text was updated successfully, but these errors were encountered: