Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCE] Scale-up from MIG size 0 not working with resource requests #7607

Open
whisperity opened this issue Dec 13, 2024 · 1 comment
Open
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@whisperity
Copy link

whisperity commented Dec 13, 2024

Which component are you using?: /area cluster-autoscaler

What version of the component are you using?:

Component version: 1.31.0

What k8s version are you using (kubectl version)?: 1.31

kubectl version Output
$ kubectl version
Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.31.3

What environment is this in?: Google Compute Engine (virtual machines) with Instance Templates and Instance Groups. VMs are stock Debian 12.0 with kubelet installed from official sources. NOT using https://cloud.google.com/kubernetes-engine?hl=en!

What did you expect to happen?: When there are pending pods to be scheduled with resource requests that would be satisfied by the creation of a machine from the configured machine group, machines should instantiate, nodes should join, and the scaling work.

What happened instead?: No scheduling, the pods kept waiting.

I1213 15:23:16.418500       1 orchestrator.go:594] Pod ci-runners-normal/amd64-XXXX-task-YYYY can't be scheduled on https://www.googleapis.com/compute/v1/projects/XXX/zones/YYY/instanceGroups/ci-e2-standard-4, predicate checking error: Insufficient cpu, Insufficient memory; predicateName=NodeResourcesFit; reasons: Insufficient cpu, Insufficient memory; debugInfo=

How to reproduce it (as minimally and precisely as possible):

  1. Create a master VM that is very weak (e2-small) which will run the Kubernetes cluster, and go through the steps of installing kubeadm, kubelet, flannel, to get the cluster ready.
  2. Have a Pod-Spec that requests something like "4 CPU"s and "16000Mi" RAM.
  3. Create an e2-standard-4 (4 CPU, 16G RAM) instance template and a corresponding instance group.
  4. Set up autoscaler according to the docs, provide the instance group name and the scaling factor.
  5. Try to deploy the pod. Creation of the e2-standard-4 instance should satisfy the requirements to deploy the pod to a sufficiently powerful machine, but nothing happens.
  6. Start a machine manually and make it kubeadm join, and then the scheduling works.
  7. Create more replicas of the previous pod. Now new VMs will keep on starting, because the scheduler now knows that the aforementioned group can spawn new VMs that satisfy the requirements.

Anything else we need to know?:

This issue is part of the "usual" inability-to-scale-up-from-0-workers situation. My previous attempt was using nodeSelector and hard-coded labels (that establish after the VMs initialise and perform kubeadm join) to pin pods to appropriate nodes, but that was not working either.
If there is a node alive in the group, the autoscaler (until the cluster is reset) finds the necessary connection and knows which group to instantiate.

Kubernetes knows about the capabilities of the node
$ kubectl describe node ci-e2-standard-4-XXXX
[…]
Capacity:
  cpu:                4
  ephemeral-storage:  16273284Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16386268Ki
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  14997458510
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16283868Ki
  pods:               110
[…]
The MachineType API sufficiently describes the performance capabilities
$ gcloud compute instance-templates describe ci-e2-standard-4
kind: compute#instanceTemplate
name: ci-worker-2c4t-r16g-st32g
properties:
[…]
  machineType: e2-standard-4

$ gcloud compute machine-types describe e2-standard-4
creationTimestamp: '1969-12-31T16:00:00.000-08:00'
description: Efficient Instance, 4 vCPUs, 16 GB RAM
guestCpus: 4
id: '335004'
imageSpaceGb: 0
isSharedCpu: false
kind: compute#machineType
maximumPersistentDisks: 128
maximumPersistentDisksSizeGb: '263168'
memoryMb: 16384
name: e2-standard-4
selfLink: https://www.googleapis.com/compute/v1/projects/XXX/zones/YYY/machineTypes/e2-standard-4
zone: YYY

@whisperity whisperity added the kind/bug Categorizes issue or PR as related to a bug. label Dec 13, 2024
@whisperity
Copy link
Author

/area cluster-autoscaler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants