[BUG] On v1.2.1 -> v1.2-head -> v1.3.0-rc4 Harvester Airgapped, RKE1 (v1.26.X -> v1.27.10) Node Fails to Update To Harvester Cloud Provider: harvester-cloud-provider:102.0.2+up0.2.3 on #5348

irishgordo · 2024-03-12T19:35:47Z

Describe the bug
On an air-gapped, upgraded, setup.
Harvester Cloud Provider fails to update on an RKE1 Node at v1.27.10.
Encountering:

0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

Stays stuck at charts using:
airgap-docker-registry.192.168.120.236.sslip.io:5000/rancher/harvester-cloud-provider:v0.1.5

But with:

╭─mike at suse-workstation-team-harvester in ~
╰─○ curl -k https://airgap-docker-registry.192.168.120.236.sslip.io:5000/v2/rancher/harvester-cloud-provider/tags/list | jq  
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    80  100    80    0     0    998      0 --:--:-- --:--:-- --:--:--  1000
{
  "name": "rancher/harvester-cloud-provider",
  "tags": [
    "v0.2.0",
    "v0.1.4",
    "v0.1.5"
  ]
}

Image is visible that's needed for new chart.
Additionally:

╭─mike at suse-workstation-team-harvester in ~
╰─○ curl -k https://airgap-docker-registry.192.168.120.236.sslip.io:5000/v2/rancher/mirrored-kube-vip-kube-vip-iptables/tags/list | jq 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    73  100    73    0     0    933      0 --:--:-- --:--:-- --:--:--   935
{
  "name": "rancher/mirrored-kube-vip-kube-vip-iptables",
  "tags": [
    "v0.6.0"
  ]
}

rancher/mirrored-kube-vip-kube-vip-iptables is present on the docker registry

And the .tgz can be downloaded correctly from:
https://rancher-extra.192.168.120.196.sslip.io/k8s/clusters/c-4nfvl/v1/catalog.cattle.io.clusterrepos/rancher-charts?chartName=harvester-cloud-provider&link=chart&version=102.0.2%2Bup0.2.3

To Reproduce
Steps to reproduce the behavior:

Have a 3 node air gapped cluster with all appropriate integrations airgapped, upgrade from v1.2.1 to v1.2-head to v1.3.0-rc4
Upgrade your RKE1 cluster to v1.27.X from v1.26.X
Try to install the new harvester-cloud-provider chart

Expected behavior
Harvester Cloud Provider Chart to be able to be updated from air-gapped upgraded v1.3.0-rc4 Harvester w/ v2.7.11 Rancher with v1.27.X RKE1 based node running on Harvester

Environment

Harvester ISO version: v1.3.0-rc4 (prior v1.2-head, prior v1.2.1)
Underlying Infrastructure: bare-metal hp dl160

Workaround

Rolling back to the previous version is successful

Additional context

helm upgrade --cleanup-on-fail=true --history-max=5 --install=true --namespace=kube-system --timeout=5m0s --values=/home/shell/helm/values-harvester-cloud-provider-102.0.2-up0.2.3.yaml --version=102.0.2+up0.2.3 --wait=true harvester-cloud-provider /home/shell/helm/harvester-cloud-provider-102.0.2-up0.2.3.tgz
checking 8 resources for changes
Looks like there are no changes for ServiceAccount "kube-vip"
Looks like there are no changes for ServiceAccount "harvester-cloud-provider"
Looks like there are no changes for ClusterRole "kube-vip"
Looks like there are no changes for ClusterRole "harvester-cloud-provider"
Looks like there are no changes for ClusterRoleBinding "kube-vip"
Looks like there are no changes for ClusterRoleBinding "harvester-cloud-provider"
Patch DaemonSet "kube-vip" in namespace kube-system
Patch Deployment "harvester-cloud-provider" in namespace kube-system
beginning wait for 8 resources with timeout of 5m0s
Deployment is not ready: kube-system/harvester-cloud-provider. 0 out of 1 expected pods are ready
Deployment is not ready: kube-system/harvester-cloud-provider. 0 out of 1 expected pods are ready
Deployment is not ready: kube-system/harvester-cloud-provider. 0 out of 1 expected pods are ready
Deployment is not ready: kube-system/harvester-cloud-provider. 0 out of 1 expected pods are ready
Deployment is not ready: kube-system/harvester-cloud-provider. 0 out of 1 expected pods are ready
********
# continues till timeout
********
Error: UPGRADE FAILED: context deadline exceeded

The text was updated successfully, but these errors were encountered:

starbops · 2024-03-13T04:32:38Z

I can reproduce this even with Rancher v2.7.10 with RKE1 v1.26.14 + harvester-cloud-provider 102.0.1+up0.1.14 on a Harvester v1.3.0-rc4 cluster. After initiated the upgrade for harvester-cloud-provider (targeting 102.0.2+up0.2.3), the Helm upgrade job cannot finish:

Checking the newly spawned pod (with container image v0.2.0), it seems that the new pod cannot be scheduled to any of the nodes.

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2024-03-13T04:01:09Z"
  generateName: harvester-cloud-provider-65556b59c9-
  labels:
    app.kubernetes.io/instance: harvester-cloud-provider
    app.kubernetes.io/name: harvester-cloud-provider
    pod-template-hash: 65556b59c9
  name: harvester-cloud-provider-65556b59c9-228t5
  namespace: kube-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: harvester-cloud-provider-65556b59c9
    uid: 8e2f0fa9-3284-41ca-af36-bc6e1fbcef0e
  resourceVersion: "5134"
  uid: a645176c-14bf-4b85-b64c-ca0f961b94f9
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/name
            operator: In
            values:
            - harvester-cloud-provider
        topologyKey: kubernetes.io/hostname
  containers:
  - args:
    - --cloud-config=/etc/kubernetes/cloud-config
    - --cluster-name=frisbee
    command:
    - harvester-cloud-provider
    image: rancher/harvester-cloud-provider:v0.2.0
    imagePullPolicy: IfNotPresent
    name: harvester-cloud-provider
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/kubernetes/cloud-config
      name: cloud-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-7hmj2
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  nodeSelector:
    kubernetes.io/os: linux
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: harvester-cloud-provider
  serviceAccountName: harvester-cloud-provider
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node.cloudprovider.kubernetes.io/uninitialized
    operator: Equal
    value: "true"
  - effect: NoSchedule
    key: node-role.kubernetes.io/control-plane
    operator: Equal
  - effect: NoExecute
    key: node-role.kubernetes.io/etcd
    operator: Equal
  - effect: NoSchedule
    key: cattle.io/os
    operator: Equal
    value: linux
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - hostPath:
      path: /etc/kubernetes/cloud-config
      type: File
    name: cloud-config
  - name: kube-api-access-7hmj2
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-03-13T04:01:09Z"
    message: '0/1 nodes are available: 1 node(s) didn''t match pod anti-affinity rules.
      preemption: 0/1 nodes are available: 1 No preemption victims found for incoming
      pod..'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: BestEffort

This is because it's essentially a single-node RKE1 cluster. The scheduler cannot violate the PodAntiAffinity rule as the old one is still running.

$ kubectl -n kube-system get po -l app.kubernetes.io/name=harvester-cloud-provider     
NAME                                        READY   STATUS    RESTARTS   AGE
harvester-cloud-provider-65556b59c9-228t5   0/1     Pending   0          13m
harvester-cloud-provider-856bcf647c-lqzln   1/1     Running   0          26m

So after 10 minutes, the Helm upgrade job failed:

Wondering if @irishgordo observed the same situation at that moment?

The weird parts in the original issue:

The harvester-cloud-provider deployment is semi-updated
- The labels are new, indicating it was updated
- However, the container image is the old one
The Helm chart history shows Upgrade "harvester-cloud-provider" failed: context deadline exceeded, instead of Upgrade "harvester-cloud-provider" failed: timed out waiting for the condition
There's no pending pod (waiting to be scheduled on a node)

starbops · 2024-03-13T07:54:02Z

I cannot reproduce it with the following steps:

Provision a Rancher instance in v2.7.11
Provision a 2-node Harvester cluster in v1.3.0-rc4
Create an RKE1 guest cluster (single node) in v1.26.14 with Harvester cloud provider (out-of-tree)
Install the harvester-cloud-provider chart in 102.0.1+up0.1.14 from the marketplace on the RKE1 guest cluster
Upgrade the RKE1 guest cluster to v1.27.11
Upgrade harvester-cloud-provider to 102.0.2+up0.2.3 from the marketplace

The upgrade of the harvester-cloud-provider chart was indeed unsuccessful. The Helm upgrade job failed with the same error message as the original one:

However, I didn't observe the same phenomenon: The harvester-cloud-provider's image tag was changed to v0.2.0. It's almost the same as my findings in the above comment. The new harvester-cloud-provider pod cannot be scheduled to the node because of the pod anti-affinity rule and remains pending.

FWIW, removing the old harvester-cloud-provider pod could make the deployment healthy. Then we need to apply the workaround mentioned here to make sure the load balancer functionality works as expected (this is fixed in harvester-cloud-provider chart 0.2.4.)

irishgordo · 2024-03-13T16:14:41Z

@starbops -thanks so much for the investigation 😄
You're correct, previously I did observe that same situation, the first trial, but then had modified the deadline to be 5 min instead of 10.

But yes, the work-around aspect is very effeicent.

In general the steps are:

Upgrade Harvester Cloud Provider
While Upgrade is Running, remove the old Cloud Provider Pod
The new one drops in no problem
Then modify the kube-vip daemonset on the RKE1 cluster following your workaround, essentially:

# before
kube-vip:
  nodeSelector:
    node-role.kubernetes.io/control-plane: "true"

# after
kube-vip:
  nodeSelector:
    node-role.kubernetes.io/control-plane: null
    node-role.kubernetes.io/controlplane: "true"

Observe kube-vip restart
Then you can build corresponding LBs & Deployments like nginx:latest (airgapped in this case) without issue, the LB gets built Harvester side and the Rancher side at v2.7.11 is good 😄

@starbops should I leave this open to track, or I can go ahead and close this out too?

starbops · 2024-03-14T03:07:59Z

@irishgordo Thanks for clarifying! Since this will only happen to single-node RKE1 guest clusters, and we've figured out the workaround, we could lower the severity and see if adding a descriptive note on the documentation for addressing such a situation with a workaround is enough. cc @bk201

irishgordo · 2024-03-14T16:21:51Z

@starbops - that sounds good 😄 👍 - I've lowered the severity.
The work-around does work really well and it's relatively easy to implement, so I don't think this should be severe 👍

starbops · 2024-04-15T03:42:30Z

We can add the deployment strategy Recreate to the harvester-cloud-provider Deployment template learned from harvester/vm-import-controller#24. But it will cause a transient service outage since the new pods are only created after removing all the existing cloud-provider pods if it's a HA-enabled cluster.

starbops · 2024-04-15T06:00:26Z

I found that adjusting the default rolling update strategy is better than changing it to the Recreate type entirely. By setting the .spec.strategy.rollingUpdate.maxUnavailable to 1 instead of the default value 25% will make the configuration work in every scenario.

Single node cluster: It's inevitable there will be service downtime no matter whether the Recreate type, or the RollingUpdate type is configured
HA-enabled cluster (with more than one control-plane node): With maxUnavailable of RollingUpdate set to 1, there will be, at most, only one pod down at any time point. However, all pods will be down with the Recreate type.

harvesterhci-io-github-bot · 2024-06-20T02:43:45Z

harvesterhci-io-github-bot · 2024-06-20T02:43:46Z

Automation e2e test issue: harvester/tests#1326

albinsun · 2024-10-21T03:53:21Z

Test PASS with Rancher Prime v2.8.6 and harvester-cloud-provider-v0.2.5+.

Note

Note that this issue fixed in harvester-cloud-provider:v0.2.5 which included in rancher-v2.8+ ¹, so we use Rancher Prime v2.8.6 instead of v2.7.11 reported by ticket.

Environment

harvester-v1.4.0-rc3
- Profile: Single node AMD64 QEMU/KVM
- ui-source: Auto
rancher-v2.8.6
- Profile: Single node AMD64 QEMU/KVM
- K3s Channel: v1.28

Prerequisite

Create image ubuntu-22.04-jammy-server-cloudimg-amd64.img
Create VM network mgmt-vlan1
Import Harvester to Rancher

Test Steps

✔️ Create RKE1 v1.26 cluster
✔️ Install harvester-cloud-provider:102.0.1+up0.1.14
✔️ Deploy mynginx1
✔️ Upgrade RKE1 cluster to v1.27
✔️ mynginx1 still works,deploy mynginx2
✔️ Upgrade harvester-cloud-provider chart to v0.2.5+
- Marketplace
- Upgrade
✔️ All mynginxs still works,deploy mynginx3

https://github.com/rancher/charts/tree/release-v2.8/charts/harvester-cloud-provider ↩

irishgordo · 2024-10-25T22:04:34Z

thanks for this check @albinsun 👍

irishgordo added severity/3 Function working but has a major issue w/ workaround and removed severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) labels Mar 14, 2024

starbops added the require/doc Improvements or additions to documentation label Mar 15, 2024

starbops self-assigned this Mar 15, 2024

starbops mentioned this issue Apr 15, 2024

fix: add maxUnavailable for cloud provider upgrade harvester/charts#239

Merged

This was referenced Jun 12, 2024

Bump cloud provider 0.2.5 harvester/charts#245

Merged

[BUG] Fail to Upgrade harvester-cloud-provider on RKE2 cluster (rancher-v2.8.4) #5873

Closed

doc: add note about how to address ccm upgrade stuck harvester/docs#584

Merged

harvesterhci-io-github-bot mentioned this issue Jun 20, 2024

[e2e] [BUG] On v1.2.1 -> v1.2-head -> v1.3.0-rc4 Harvester Airgapped, RKE1 (v1.26.X -> v1.27.10) Node Fails to Update To Harvester Cloud Provider: harvester-cloud-provider:102.0.2+up0.2.3 on harvester/tests#1326

Open

1 task

starbops added this to the v1.4.0 milestone Jun 20, 2024

albinsun mentioned this issue Oct 16, 2024

[BUG] harvester-cloud-provider roll out different image in Rancher cross-upgrade (Community <-> Prime) #6799

Open

albinsun assigned albinsun and unassigned albinsun Oct 17, 2024

albinsun closed this as completed Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] On v1.2.1 -> v1.2-head -> v1.3.0-rc4 Harvester Airgapped, RKE1 (v1.26.X -> v1.27.10) Node Fails to Update To Harvester Cloud Provider: harvester-cloud-provider:102.0.2+up0.2.3 on #5348

[BUG] On v1.2.1 -> v1.2-head -> v1.3.0-rc4 Harvester Airgapped, RKE1 (v1.26.X -> v1.27.10) Node Fails to Update To Harvester Cloud Provider: harvester-cloud-provider:102.0.2+up0.2.3 on #5348

irishgordo commented Mar 12, 2024

starbops commented Mar 13, 2024 •

edited

Loading

starbops commented Mar 13, 2024

irishgordo commented Mar 13, 2024

starbops commented Mar 14, 2024

irishgordo commented Mar 14, 2024

starbops commented Apr 15, 2024 •

edited

Loading

starbops commented Apr 15, 2024 •

edited

Loading

harvesterhci-io-github-bot commented Jun 20, 2024 •

edited by starbops

Loading

harvesterhci-io-github-bot commented Jun 20, 2024

albinsun commented Oct 21, 2024

irishgordo commented Oct 25, 2024

[BUG] On v1.2.1 -> v1.2-head -> v1.3.0-rc4 Harvester Airgapped, RKE1 (v1.26.X -> v1.27.10) Node Fails to Update To Harvester Cloud Provider: harvester-cloud-provider:102.0.2+up0.2.3 on #5348

[BUG] On v1.2.1 -> v1.2-head -> v1.3.0-rc4 Harvester Airgapped, RKE1 (v1.26.X -> v1.27.10) Node Fails to Update To Harvester Cloud Provider: harvester-cloud-provider:102.0.2+up0.2.3 on #5348

Comments

irishgordo commented Mar 12, 2024

starbops commented Mar 13, 2024 • edited Loading

starbops commented Mar 13, 2024

irishgordo commented Mar 13, 2024

starbops commented Mar 14, 2024

irishgordo commented Mar 14, 2024

starbops commented Apr 15, 2024 • edited Loading

starbops commented Apr 15, 2024 • edited Loading

harvesterhci-io-github-bot commented Jun 20, 2024 • edited by starbops Loading

Pre Ready-For-Testing Checklist

harvesterhci-io-github-bot commented Jun 20, 2024

albinsun commented Oct 21, 2024

Environment

Prerequisite

Test Steps

Footnotes

irishgordo commented Oct 25, 2024

starbops commented Mar 13, 2024 •

edited

Loading

starbops commented Apr 15, 2024 •

edited

Loading

starbops commented Apr 15, 2024 •

edited

Loading

harvesterhci-io-github-bot commented Jun 20, 2024 •

edited by starbops

Loading