[GCE] Scale-up from MIG size 0 not working with resource requests #7607

whisperity · 2024-12-13T16:23:50Z

Which component are you using?: /area cluster-autoscaler

What version of the component are you using?:

Component version: 1.31.0

What k8s version are you using (kubectl version)?: 1.31

kubectl version Output

$ kubectl version
Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.31.3

What environment is this in?: Google Compute Engine (virtual machines) with Instance Templates and Instance Groups. VMs are stock Debian 12.0 with kubelet installed from official sources. NOT using https://cloud.google.com/kubernetes-engine?hl=en!

What did you expect to happen?: When there are pending pods to be scheduled with resource requests that would be satisfied by the creation of a machine from the configured machine group, machines should instantiate, nodes should join, and the scaling work.

What happened instead?: No scheduling, the pods kept waiting.

I1213 15:23:16.418500       1 orchestrator.go:594] Pod ci-runners-normal/amd64-XXXX-task-YYYY can't be scheduled on https://www.googleapis.com/compute/v1/projects/XXX/zones/YYY/instanceGroups/ci-e2-standard-4, predicate checking error: Insufficient cpu, Insufficient memory; predicateName=NodeResourcesFit; reasons: Insufficient cpu, Insufficient memory; debugInfo=

How to reproduce it (as minimally and precisely as possible):

Create a master VM that is very weak (e2-small) which will run the Kubernetes cluster, and go through the steps of installing kubeadm, kubelet, flannel, to get the cluster ready.
Have a Pod-Spec that requests something like "4 CPU"s and "16000Mi" RAM.
Create an e2-standard-4 (4 CPU, 16G RAM) instance template and a corresponding instance group.
Set up autoscaler according to the docs, provide the instance group name and the scaling factor.
Try to deploy the pod. Creation of the e2-standard-4 instance should satisfy the requirements to deploy the pod to a sufficiently powerful machine, but nothing happens.
Start a machine manually and make it kubeadm join, and then the scheduling works.
Create more replicas of the previous pod. Now new VMs will keep on starting, because the scheduler now knows that the aforementioned group can spawn new VMs that satisfy the requirements.

Anything else we need to know?:

This issue is part of the "usual" inability-to-scale-up-from-0-workers situation. My previous attempt was using nodeSelector and hard-coded labels (that establish after the VMs initialise and perform kubeadm join) to pin pods to appropriate nodes, but that was not working either.
If there is a node alive in the group, the autoscaler (until the cluster is reset) finds the necessary connection and knows which group to instantiate.

Kubernetes knows about the capabilities of the node

$ kubectl describe node ci-e2-standard-4-XXXX
[…]
Capacity:
  cpu:                4
  ephemeral-storage:  16273284Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16386268Ki
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  14997458510
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16283868Ki
  pods:               110
[…]

The MachineType API sufficiently describes the performance capabilities

$ gcloud compute instance-templates describe ci-e2-standard-4 kind: compute#instanceTemplate name: ci-worker-2c4t-r16g-st32g properties: […] machineType: e2-standard-4

$ gcloud compute machine-types describe e2-standard-4
creationTimestamp: '1969-12-31T16:00:00.000-08:00'
description: Efficient Instance, 4 vCPUs, 16 GB RAM
guestCpus: 4
id: '335004'
imageSpaceGb: 0
isSharedCpu: false
kind: compute#machineType
maximumPersistentDisks: 128
maximumPersistentDisksSizeGb: '263168'
memoryMb: 16384
name: e2-standard-4
selfLink: https://www.googleapis.com/compute/v1/projects/XXX/zones/YYY/machineTypes/e2-standard-4
zone: YYY

The text was updated successfully, but these errors were encountered:

whisperity · 2024-12-13T17:46:43Z

/area cluster-autoscaler

whisperity added the kind/bug Categorizes issue or PR as related to a bug. label Dec 13, 2024

k8s-ci-robot added the area/cluster-autoscaler label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GCE] Scale-up from MIG size 0 not working with resource requests #7607

[GCE] Scale-up from MIG size 0 not working with resource requests #7607

whisperity commented Dec 13, 2024 •

edited

Loading

whisperity commented Dec 13, 2024

[GCE] Scale-up from MIG size 0 not working with resource requests #7607

[GCE] Scale-up from MIG size 0 not working with resource requests #7607

Comments

whisperity commented Dec 13, 2024 • edited Loading

whisperity commented Dec 13, 2024

whisperity commented Dec 13, 2024 •

edited

Loading