Skip to content

CKS Enhancements #9102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 110 commits into from
Jun 19, 2025
Merged

Conversation

nvazquez
Copy link
Contributor

@nvazquez nvazquez commented May 21, 2024

Description

Design Document: https://cwiki.apache.org/confluence/display/CLOUDSTACK/CKS+Enhancements

Documentation PR: apache/cloudstack-documentation#458

This PR extends the CloudStack Kubernetes Service functionalities, matching these requirements:

  • Ability to specify different compute or service offerings for different types of CKS cluster nodes – worker, master or etcd: The createKubernetesCluster API and the corresponding UI must provide an option to provide different offering for different types of nodes. CKS compute offerings will be marked as CKS compatible.
  • Ability to use CKS ready custom templates for CKS cluster nodes: CKS will allow users to specify their own templates for different CKS node types (control and worker) at the point of cluster creation. Those templates will be marked as CKS compatible.
  • Ability to use generic (non CKS ready) custom templates for CKS cluster nodes: CKS will allow users to specify their own templates for different CKS node types (control and worker) at the point of cluster creation. Those templates will be marked as CKS compatible. The user will be responsible for installing all necessary packages in the template.
  • Ability to add and remove a pre-created instance as a worker node to an existing CKS cluster: An instance (either virtual of physical) which has been built and prepared for CKS can been added to the desired CKS cluster. The instance must have all the CKS worker node packages installed.
  • Ability to separate etcd from master nodes of the CKS cluster: End users should be provided with an option to separate etcd cluster at the time of CKS cluster creation. The user can enable such option in the UI or in the createKubernetesCluster API and specify the size of the etcd cluster. Based on the user inputs CloudStack should be able to provision such etcd nodes for the CKS cluster.
  • Ability to mark CKS cluster nodes for manual only upgrade: An end user should be able to mark the desired compute offering (or the CKS template) for manual upgrades only. CKS cluster nodes marked for manual upgrade should be untouched during the Kubernetes version upgrade when executed using upgradeKubernetesCluster API.
  • Ability to dedicate specific hosts/clusters to a specific domain for CKS cluster deployment: The dedicateHost/dedicateCluster APIs can be used to provide this functionality to dedicate hosts/clusters for CKS cluster deployments. During the deployment of CKS cluster node VMs they will by default be deployed in the dedicated cluster.
  • Methodology for AS number management: Operators should be able to assign a range of AS numbers to an ACS Zone. ACS must have a method to assign an AS number to each Isolated network (or VPC tier), which can be retrieved via the UI and API. (Introduced on PR New feature: Dynamic and Static Routing #9470)
  • Methodology to use diverse CNI plugins (Calico, Cilium, etc…): End users should be able to deploy CKS clusters with Calico CNI. An option to specify which CNI plugin to be used for a CKS cluster must be provided in the createKubernetesClusterCmd API. The CNI configuration and setup can be registered as a managed userdata, and any configurable parameters – here, AS number, BGP Peer AS number and IP address, can be defined as variables in the userdata be set during the creation of the CKS cluster. This provides a flexible way for users to use the CNI plugin of their choice.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

In testing on vCenter 7.0 environment + NSX SDN

How did you try to break this feature and the system with this change?

nvazquez and others added 8 commits May 21, 2024 12:40
* Ability to specify different compute or service offerings for different types of CKS cluster nodes – worker, master or etcd

* Ability to use CKS ready custom templates for CKS cluster nodes

---------

Co-authored-by: Pearl Dsilva <[email protected]>
… a kubernetes cluster

---------

Co-authored-by: nvazquez <[email protected]>
* CKS: Fix ISO attach logic

* address comment
@nvazquez
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✖️ debian ✔️ suse15. SL-JID 9649

Copy link

codecov bot commented May 22, 2024

Codecov Report

Attention: Patch coverage is 10.58824% with 2052 lines in your changes missing coverage. Please review.

Project coverage is 16.57%. Comparing base (2d669db) to head (b09b75a).
Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
...bernetes/cluster/KubernetesClusterManagerImpl.java 11.95% 320 Missing and 4 partials ⚠️
...r/actionworkers/KubernetesClusterActionWorker.java 1.14% 260 Missing ⚠️
...er/actionworkers/KubernetesClusterStartWorker.java 0.00% 247 Missing ⚠️
...ster/actionworkers/KubernetesClusterAddWorker.java 0.00% 211 Missing ⚠️
...KubernetesClusterResourceModifierActionWorker.java 0.00% 119 Missing ⚠️
...er/actionworkers/KubernetesClusterScaleWorker.java 25.67% 104 Missing and 6 partials ⚠️
...r/actionworkers/KubernetesClusterRemoveWorker.java 0.00% 106 Missing ⚠️
...ava/com/cloud/upgrade/dao/Upgrade42010to42100.java 26.25% 57 Missing and 2 partials ⚠️
.../cloud/kubernetes/cluster/KubernetesClusterVO.java 0.00% 56 Missing ⚠️
...dstack/api/response/KubernetesClusterResponse.java 0.00% 53 Missing ⚠️
... and 51 more
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #9102      +/-   ##
============================================
- Coverage     16.60%   16.57%   -0.04%     
- Complexity    13924    13968      +44     
============================================
  Files          5730     5743      +13     
  Lines        508082   510468    +2386     
  Branches      61770    62073     +303     
============================================
+ Hits          84388    84615     +227     
- Misses       414259   416390    +2131     
- Partials       9435     9463      +28     
Flag Coverage Δ
uitests 3.90% <ø> (-0.04%) ⬇️
unittests 17.47% <10.58%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@nvazquez nvazquez force-pushed the cks-enhancements-upstream branch from 5710f92 to 469c08d Compare May 22, 2024 00:59
@nvazquez
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✖️ debian ✔️ suse15. SL-JID 9650

Copy link
Contributor

@Pearl1594 Pearl1594 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Tested it - works as expected.

@Pearl1594 Pearl1594 closed this Jun 13, 2025
@Pearl1594 Pearl1594 reopened this Jun 13, 2025
@sureshanaparti
Copy link
Contributor

Hi @bernardodemarco the PR looks ready and has been extensively tested also by @kiranchavala, would it be possible to get your final review as well?

ping @bernardodemarco is your concerns addressed, can you review this again.

@bernardodemarco
Copy link
Collaborator

ping @bernardodemarco is your concerns addressed, can you review this again.

Sorry for the delay, I'll try to test it today

Copy link
Collaborator

@bernardodemarco bernardodemarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some more tests that I performed:

  1. Created a Kubernetes Cluster only specifying the serviceofferingid parameter
  2. Changed the offerings of the control and worker nodes
(localcloud) 🐛 > scale kubernetescluster id=9ead6350-d7bc-4802-a22b-9851c912397c  nodeofferings[0].node="control" nodeofferings[0].offering="da00ecb4-b43e-4c2d-a4de-d7fe10b5e210" nodeofferings[1].node="worker" nodeofferings[1].offering="65fffad1-569e-496d-8b40-50c41a742689"
  1. Verified that the offerings changed correctly
  2. Tried to change the offerings of each node type by only specifying the serviceofferingid parameter. A NPE was thrown:
(localcloud) 🦈 > scale kubernetescluster id=9ead6350-d7bc-4802-a22b-9851c912397c serviceofferingid=9f07cbe1-d312-4ec6-bdfe-948479cc5977 
{
  "account": "admin",
  "accountid": "418a5137-a510-11ef-8a39-9a34acb639ea",
  "cmd": "org.apache.cloudstack.api.command.user.kubernetes.cluster.ScaleKubernetesClusterCmd",
  "completed": "2025-06-16T17:43:19+0000",
  "created": "2025-06-16T17:43:19+0000",
  "domainid": "28ef19d3-a50e-11ef-8a39-9a34acb639ea",
  "domainpath": "ROOT",
  "jobid": "589b555a-a39e-4021-9c77-42b948c13180",
  "jobprocstatus": 0,
  "jobresult": {
    "errorcode": 530,
    "errortext": "Cannot invoke \"com.cloud.offering.ServiceOffering.getCpu()\" because \"offering\" is null"
  },
  "jobresultcode": 530,
  "jobresulttype": "object",
  "jobstatus": 2,
  "userid": "418df8e2-a510-11ef-8a39-9a34acb639ea"
}
2025-06-16 17:43:19,297 ERROR [c.c.a.ApiAsyncJobDispatcher] (API-Job-Executor-13:[ctx-789343be, job-99]) (logid:589b555a) Unexpected exception while executing org.apache.cloudstack.api.command.user.kubernetes.cluster.ScaleKubernetesClusterCmd java.lang.NullPointerException: Cannot invoke "com.cloud.offering.ServiceOffering.getCpu()" because "offering" is null
	at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.calculateNodesCapacity(KubernetesClusterScaleWorker.java:261)
	at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.calculateNewClusterCountAndCapacity(KubernetesClusterScaleWorker.java:227)
	at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.updateKubernetesClusterEntryForNodeType(KubernetesClusterScaleWorker.java:202)
	at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.scaleKubernetesClusterOffering(KubernetesClusterScaleWorker.java:384)
	at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.scaleCluster(KubernetesClusterScaleWorker.java:574)
	at com.cloud.kubernetes.cluster.KubernetesClusterManagerImpl.scaleKubernetesCluster(KubernetesClusterManagerImpl.java:2102)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
--
	at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
	at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
	at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
	at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:637)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

@nvazquez
Copy link
Contributor Author

Thanks @bernardodemarco - I will work on amending this use case.

@nvazquez
Copy link
Contributor Author

Many thanks @bernardodemarco - I have fixed the NPE issue

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13802

@sureshanaparti
Copy link
Contributor

@bernardodemarco can you verify npe issue with the latest changes.

Copy link
Collaborator

@bernardodemarco bernardodemarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified that the NPE is fixed and that the scaling in the previous test case works as expected. After the operation
succeeded, this is the current state of the k8s cluster:

(segregated-lab) 🐱 > list kubernetesclusters filter=serviceofferingname,,workerofferingname,controlofferingname,
{
  "count": 1,
  "kubernetescluster": [
    {
      "controlofferingname": "control-plane",
      "serviceofferingname": "med-offering",
      "workerofferingname": "worker-plane"
    }
  ]
}
(segregated-lab) 🐱 > list virtualmachines filter=name,serviceofferingname,
{
  "count": 2,
  "virtualmachine": [
    {
      "name": "k8ss-control-1977de3df05",
      "serviceofferingname": "med-offering"
    },
    {
      "name": "k8ss-node-1977de40dc8",
      "serviceofferingname": "med-offering"
    }
  ]
}

IMO, it is a little bit confusing that it is informed for the users that the worker and control offerings are respectively equal to worker-plane and control-plane, but their actual offering is med-offering. The worker_node_service_offering_id and control_node_service_offering_id become unused in this case.

I believe that it would be interesting for us to handle these scenarios. The cluster's metadata should be consistent with the actual computing resources that are allocated for it.

Comment on lines +62 to +67
ALTER TABLE `cloud`.`kubernetes_cluster` ADD CONSTRAINT `fk_cluster__control_node_service_offering_id` FOREIGN KEY `fk_cluster__control_node_service_offering_id`(`control_node_service_offering_id`) REFERENCES `service_offering`(`id`) ON DELETE CASCADE;
ALTER TABLE `cloud`.`kubernetes_cluster` ADD CONSTRAINT `fk_cluster__worker_node_service_offering_id` FOREIGN KEY `fk_cluster__worker_node_service_offering_id`(`worker_node_service_offering_id`) REFERENCES `service_offering`(`id`) ON DELETE CASCADE;
ALTER TABLE `cloud`.`kubernetes_cluster` ADD CONSTRAINT `fk_cluster__etcd_node_service_offering_id` FOREIGN KEY `fk_cluster__etcd_node_service_offering_id`(`etcd_node_service_offering_id`) REFERENCES `service_offering`(`id`) ON DELETE CASCADE;
ALTER TABLE `cloud`.`kubernetes_cluster` ADD CONSTRAINT `fk_cluster__control_node_template_id` FOREIGN KEY `fk_cluster__control_node_template_id`(`control_node_template_id`) REFERENCES `vm_template`(`id`) ON DELETE CASCADE;
ALTER TABLE `cloud`.`kubernetes_cluster` ADD CONSTRAINT `fk_cluster__worker_node_template_id` FOREIGN KEY `fk_cluster__worker_node_template_id`(`worker_node_template_id`) REFERENCES `vm_template`(`id`) ON DELETE CASCADE;
ALTER TABLE `cloud`.`kubernetes_cluster` ADD CONSTRAINT `fk_cluster__etcd_node_template_id` FOREIGN KEY `fk_cluster__etcd_node_template_id`(`etcd_node_template_id`) REFERENCES `vm_template`(`id`) ON DELETE CASCADE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have the ON DELETE CASCADE here?

When deleting a service offering, for instance, the kubernetes cluster will be deleted from the DB, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ON DELETE CASCADE has no effect, as ACS never removes the records for templates or server offerings, just marks them as removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ACS currently soft deletes such records (only mark them as removed in the DB). However, removing the ON DELETE CASCADE could avoid accidental deletion of records by the operator (i.e., accidentally executing a DELETE query in the DB)

@nvazquez
Copy link
Contributor Author

Thanks @bernardodemarco - in the UI we are preventing that case as the user is requested to set a node per offering instead of a global one, do you think we can add the same check for the API?

Screenshot 2025-06-17 at 10 23 33

@sureshanaparti sureshanaparti moved this from Done to In Progress in Apache CloudStack 4.21.0 Jun 17, 2025
@bernardodemarco
Copy link
Collaborator

in the UI we are preventing that case as the user is requested to set a node per offering instead of a global one

@nvazquez, yes

do you think we can add the same check for the API?

The problem with this approach is that it would break backwards compatibility, right? Prior to the changes, the serviceofferingid was used to change the service offering of all nodes of a k8s cluster. With the current patch, it can still be used to achieve such action and, thus, the API behavior seems to be ok. The only drawback is that the cluster's metadata is not fully consistent.

@nvazquez
Copy link
Contributor Author

@bernardodemarco I think it doesn't break the backwards compatibility as previously created CKS clusters or new clusters not using the advanced options remain the same, the scale works providing the global service offering. However, the use case you present seems to be valid, I suggested adding the check to match the UI in which case users can set the same offering per node type to achieve the same operation, but will explore a solution

…obal service offering has been provided on scale
@nvazquez
Copy link
Contributor Author

@bernardodemarco I've addressed the use case:

  • Having a different offering per node type:
(localcloud) 🐱 > list virtualmachines filter=id,name,serviceofferingname,
{
  "count": 2,
  "virtualmachine": [
    {
      "id": "87fbaefe-212f-4cdd-a006-50f3461a60bc",
      "name": "cks-control-1977e7f1130",
      "serviceofferingname": "Control-Offering"
    },
    {
      "id": "b5fc5c71-5aa3-4268-bb48-befe8789b090",
      "name": "cks-node-1977e7f5641",
      "serviceofferingname": "Worker-Offering"
    }
(localcloud) 🐱 > list kubernetesclusters id=54de3e3c-a965-41b3-961f-894c5bc66e15 filter=serviceofferingname,controlofferingname,workerofferingname,
{
  "count": 1,
  "kubernetescluster": [
    {
      "controlofferingname": "Control-Offering",
      "serviceofferingname": "CKS - 2GB",
      "workerofferingname": "Worker-Offering"
    }
  ]
}
  • Update global offering for the cluster:
scale kubernetescluster id=54de3e3c-a965-41b3-961f-894c5bc66e15 serviceofferingid=386a2c4a-5d16-4a4e-b27e-ecc84538cbfe 
  • Verification:
(localcloud) 🐱 > list kubernetesclusters id=54de3e3c-a965-41b3-961f-894c5bc66e15 filter=serviceofferingname,controlofferingname,workerofferingname,
{
  "count": 1,
  "kubernetescluster": [
    {
      "controlofferingname": "New-Offering",
      "serviceofferingname": "New-Offering",
      "workerofferingname": "New-Offering"
    }
  ]
}
(localcloud) 🐱 > list virtualmachines filter=id,name,serviceofferingname,
{
  "count": 2,
  "virtualmachine": [
    {
      "id": "87fbaefe-212f-4cdd-a006-50f3461a60bc",
      "name": "cks-control-1977e7f1130",
      "serviceofferingname": "New-Offering"
    },
    {
      "id": "b5fc5c71-5aa3-4268-bb48-befe8789b090",
      "name": "cks-node-1977e7f5641",
      "serviceofferingname": "New-Offering"
    }
  ]
}

@nvazquez
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13822

Copy link
Collaborator

@bernardodemarco bernardodemarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvazquez, nice, verified that the cluster's metadata is consistent with the actual offerings applied to each node.


Here are other tests that I performed:

  1. Created a k8s cluster with a global offering (min-k8s-offering)
  2. Changed the offerings of the control and worker nodes by manipulating the nodeofferings parameter:
(segregated-lab) 🐱 > list virtualmachines filter=name,serviceofferingname,
{
  "count": 2,
  "virtualmachine": [
    {
      "name": "k8s-control-1978362ca43",
      "serviceofferingname": "control-plane"
    },
    {
      "name": "k8s-node-1978362fcd6",
      "serviceofferingname": "worker-plane"
    }
  ]
}
(segregated-lab) 🐱 > list kubernetesclusters filter=name,serviceofferingname,workerofferingname,controlofferingname,
{
  "count": 1,
  "kubernetescluster": [
    {
      "controlofferingname": "control-plane",
      "name": "k8s",
      "serviceofferingname": "min-k8s-offering",
      "workerofferingname": "worker-plane"
    }
  ]
}
  1. Changed the global offering of the cluster back to the min-k8s-offering:
(segregated-lab) 🐱 > scale kubernetescluster id=83b45929-6c50-4002-86b0-83db70a6f2d2 serviceofferingid=1494d756-55b2-4b0e-a062-6f8420e5becf 
# (...)
(segregated-lab) 🐱 > list virtualmachines filter=name,serviceofferingname,
{
  "count": 2,
  "virtualmachine": [
    {
      "name": "k8s-control-1978362ca43",
      "serviceofferingname": "control-plane"
    },
    {
      "name": "k8s-node-1978362fcd6",
      "serviceofferingname": "worker-plane"
    }
  ]
}
(segregated-lab) 🐱 > list kubernetesclusters filter=name,serviceofferingname,workerofferingname,controlofferingname,
{
  "count": 1,
  "kubernetescluster": [
    {
      "controlofferingname": "control-plane",
      "name": "k8s",
      "serviceofferingname": "min-k8s-offering",
      "workerofferingname": "worker-plane"
    }
  ]
}
  1. A successful response was returned, but the nodes were not actually scaled. It would be interesting to return an error message to the end user in this scenario.

With the same cluster, scaled its global offering to the med-offering. Verified that the operation was successfully executed:

(segregated-lab) 🐱 > list kubernetesclusters filter=name,serviceofferingname,workerofferingname,controlofferingname,
{
  "count": 1,
  "kubernetescluster": [
    {
      "controlofferingname": "med-offering",
      "name": "k8s",
      "serviceofferingname": "med-offering",
      "workerofferingname": "med-offering"
    }
  ]
}
(segregated-lab) 🐱 > list virtualmachines filter=name,serviceofferingname,
{
  "count": 2,
  "virtualmachine": [
    {
      "name": "k8s-control-1978362ca43",
      "serviceofferingname": "med-offering"
    },
    {
      "name": "k8s-node-1978362fcd6",
      "serviceofferingname": "med-offering"
    }
  ]
}

Successfully scaled only the control nodes:

(segregated-lab) 🐱 > list kubernetesclusters filter=name,serviceofferingname,workerofferingname,controlofferingname,
{
  "count": 1,
  "kubernetescluster": [
    {
      "controlofferingname": "max-offering",
      "name": "k8s",
      "serviceofferingname": "med-offering",
      "workerofferingname": "med-offering"
    }
  ]
}
(segregated-lab) 🐱 > list virtualmachines filter=name,serviceofferingname,
{
  "count": 2,
  "virtualmachine": [
    {
      "name": "k8s-control-1978362ca43",
      "serviceofferingname": "max-offering"
    },
    {
      "name": "k8s-node-1978362fcd6",
      "serviceofferingname": "med-offering"
    }
  ]
}

Successfully scaled only the worker nodes:

(segregated-lab) 🐱 > list virtualmachines filter=name,serviceofferingname,
{
  "count": 2,
  "virtualmachine": [
    {
      "name": "k8s-control-1978362ca43",
      "serviceofferingname": "max-offering"
    },
    {
      "name": "k8s-node-1978362fcd6",
      "serviceofferingname": "max-offering"
    }
  ]
}
(segregated-lab) 🐱 > list kubernetesclusters filter=name,serviceofferingname,workerofferingname,controlofferingname,
{
  "count": 1,
  "kubernetescluster": [
    {
      "controlofferingname": "max-offering",
      "name": "k8s",
      "serviceofferingname": "med-offering",
      "workerofferingname": "max-offering"
    }
  ]
}

Scaled the cluster specifying the global offering and specific offerings for the worker and control nodes. Verified that the global offering was ignored, as expected, and the nodes were scaled to the offerings specified in the nodeofferings parameter.


Deployed a new cluster, with the min-k8s-offering global offering. Scaled the cluster updating the offering of the worker nodes and the cluster's global offering:

scale kubernetescluster id=d43f889e-341e-465d-983f-e5f1537dc991 serviceofferingid=345ec1d7-7c43-4345-9acf-affdebb6ef38 nodeofferings[0].node="worker" nodeofferings[0].offering="9a5a2e85-3d99-468e-9467-fa4aae7a37a6"

Verified that the worker offering was correctly scaled. It was expected, however, the offering for the control nodes to be scaled to the control-plane one (345ec1d7-7c43-4345-9acf-affdebb6ef38). However, it was not changed:

(segregated-lab) 🐱 > list virtualmachines filter=name,serviceofferingname,
{
  "count": 2,
  "virtualmachine": [
    {
      "name": "k8sss-control-197837dbc1f",
      "serviceofferingname": "min-k8s-offering"
    },
    {
      "name": "k8sss-node-197837def68",
      "serviceofferingname": "worker-plane"
    }
  ]
}
(segregated-lab) 🐱 > list kubernetesclusters filter=name,serviceofferingname,workerofferingname,controlofferingname,
{
  "count": 1,
  "kubernetescluster": [
    {
      "name": "k8sss",
      "serviceofferingname": "min-k8s-offering",
      "workerofferingname": "worker-plane"
    }
  ]
}

Verified a similar scenario of the previous one. Updated the global offering and the control offering at the same time. The control offering was scaled successfully, but the global offering was ignored:

(segregated-lab) 🐱 > scale kubernetescluster id=d43f889e-341e-465d-983f-e5f1537dc991 serviceofferingid=730f4c2b-7729-425f-81d7-fbb556d0eef3  nodeofferings[0].node="control" nodeofferings[0].offering="345ec1d7-7c43-4345-9acf-affdebb6ef38"
# (...)
(segregated-lab) 🐱 > list virtualmachines filter=name,serviceofferingname,
{
  "count": 2,
  "virtualmachine": [
    {
      "name": "k8sss-control-197837dbc1f",
      "serviceofferingname": "control-plane"
    },
    {
      "name": "k8sss-node-197837def68",
      "serviceofferingname": "worker-plane"
    }
  ]
}
(segregated-lab) 🐱 > list kubernetesclusters filter=name,serviceofferingname,workerofferingname,controlofferingname,
{
  "count": 1,
  "kubernetescluster": [
    {
      "controlofferingname": "control-plane",
      "name": "k8sss",
      "serviceofferingname": "min-k8s-offering",
      "workerofferingname": "worker-plane"
    }
  ]
}

@nvazquez
Copy link
Contributor Author

Many thanks @bernardodemarco for your testing. In my understanding your case 3-4 are expected as the global offering was actually the same as the current one, I can add a message on the API response for that case. For the last 2 cases I think that is also expected as per the parameter description:

(Optional) Node Type to Service Offering ID mapping. If provided, it overrides the serviceofferingid parameter

If you agree we can proceed with this PR and discuss these corner cases on separate issue/PR, as the user would have a workaround by providing the offerings at the same time for each node type

Copy link
Collaborator

@bernardodemarco bernardodemarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding your case 3-4 are expected as the global offering was actually the same as the current one, I can add a message on the API response for that case.

@nvazquez, yes, it would be nice.

For the last 2 cases I think that is also expected as per the parameter description:

Yes, I agree. I hadn’t paid attention to the parameter description, so that shouldn’t be a problem at all.


@nvazquez, @Pearl1594, @sureshanaparti, @weizhouapache, @DaanHoogland, @kiranchavala, as for the use case of specifying segregated compute offerings for each Kubernetes cluster plane, it looks good to me.

Regarding the other eight use cases addressed by this PR, unfortunately, I don’t have time to test them thoroughly enough to cover all possible workflows or to review more than 7,000 lines of code. Thus, I won’t be providing an opinion on those parts.

@sureshanaparti
Copy link
Contributor

Merging this, based on reviews and tests.

@sureshanaparti sureshanaparti merged commit 6adfda2 into apache:main Jun 19, 2025
24 of 26 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Apache CloudStack 4.21.0 Jun 19, 2025
@DaanHoogland DaanHoogland deleted the cks-enhancements-upstream branch June 19, 2025 06:44
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Jun 19, 2025
CKS Enhancements:

* Ability to specify different compute or service offerings for different types of CKS cluster nodes – worker, master or etcd

* Ability to use CKS ready custom templates for CKS cluster nodes

* Add and Remove external nodes to and from a kubernetes cluster

Co-authored-by: nvazquez <[email protected]>

* Update remove node timeout global setting

* CKS/NSX : Missing variables in worker nodes

* CKS: Fix ISO attach logic

* CKS: Fix ISO attach logic

* address comment

* Fix Port - Node mapping when cluster is scaled in the presence of external node(s)

* CKS: Externalize control and worker node setup wait time and installation attempts

* Fix logger

* Add missing headers and fix end of line on files

* CKS Mark Nodes for Manual Upgrade and Filter Nodes to add to CKS cluster from the same network

* Add support to deploy CKS cluster nodes on hosts dedicated to a domain

---------

Co-authored-by: Pearl Dsilva <[email protected]>

* Support unstacked ETCD

---------

Co-authored-by: nvazquez <[email protected]>

* Fix CKS cluster scaling and minor UI improvement

* Reuse k8s cluster public IP for etcd nodes and rename etcd nodes

* Fix DNS resolver issue

* Update UDP active monitor to ICMP

* Add hypervisor type to CKS cluster creation to fix CKS cluster creation when External hosts added

* Fix build

* Fix logger

* Modify hypervisor param description in the create CKS cluster API

* CKS delete fails when external nodes are present

* CKS delete fails when external nodes are present

* address comment

* Improve network rules cleanup on failure adding external nodes to CKS cluster

* UI: Fix etcd template was not honoured

* UI: Fix etcd template was not honoured

* Refactor

* CKS: Exclude etcd nodes when calculating port numbers

* Fix network cleanup in case of CKS cluster failure

* Externalize retries and inverval for NSX segment deletion

* Fix CKS scaling when external node(s) present in the cluster

* CKS: Fix port numbers displayed against ETCD nodes

* Add node version details to every node of k8s cluster - as we now support manual upgrade

* Add node version details to every node of k8s cluster - as we now support manual upgrade

* update column name

* CKS: Exclude etcd nodes when calculating port numbers

* update param name

* update param

* UI: Fix CKS cluster creation templates listing for non admins

* CKS: Prevent etcd node start port number to coincide with k8s cluster start port numbers

* CKS: Set default kubernetes cluster node version to the kubernetes cluster version on upgrade

* CKS: Set default kubernetes cluster node version to the kubernetes cluster version on upgrade

* consolidate query

* Fix upgrade logic

---------

Co-authored-by: nvazquez <[email protected]>

* Fix CKS cluster version upgrade

* CKS: Fix etcd port numbers being skipped

* Fix CKS cluster with etcd nodes on VPC

* Move schema and upgrade for 4.20

* Fix logger

* Fix after rebasing

* Add support for using different CNI plugins with CKS

* Add support for using different CNI plugins with CKS

* remove unused import

* Add UI support and list cni config API

* necessary UI changes

* add license

* changes to support external cni

* UI changes

* Fix NPE on restarting VPC with additional public IPs

* fix merge conflict

* add asnumber to create k8s svc layer

* support cni framework to use as-numbers

* update code

* condition to ignore undefined jinja template variables

* CKS: Do not pass AS number when network ID is passed

* Fix deletion of Userdata / CNI Configuration in projects

* CKS: Add CNI configuration details to the response and UI

* Explicit events for registering cni configuration

* Add Delete cni configuration API

* Fix CKS deployment when using VPC tiers with custom ACLs

* Fix DNS list on VR

* CKS: Use Network offering of the network passed during CKS cluster creation to get the AS number

* CKS cluster with guest IP

* Fix: Use control node guest IP as join IP for external nodes addition

* Fix DNS resolver issue

* Improve etcd indexing - start from 1

* CKS: Add external node to a CKS cluster deployed with etcd node(s) successfully

* CKS: Add external node to a CKS cluster deployed with etcd node(s) successfully

* simplify logic

* Tweak setup-kube-system script for baremetal external nodes

* Consider cordoned nodes while getting ready nodes

* Fix CKS cluster scale calculations

* Set token TTL to 0 (no expire) for external etcd

* Fix missing quotes

* Fix build

* Revert PR 9133

* Add calico commands for ens35 interface

* Address review comments: plan CKS cluster deployment based on the node type

* Add qemu-guest-agent dependency for kvm based templates

* Add marvin test for CKS clusters with different offerings per node type

* Remove test tag

* Add marvin test and fix update template for cks and since annotations

* Fix marvin test for adding and removing external nodes

* Fix since version on API params

* Address review comments

* Fix unit test

* Address review comments

* UI: Make CKS public templates visible to non-admins on CKS cluster creation

* Fix linter

* Fix merge error

* Fix positional parameters on the create kubernetes ISO script and make the ETCD version optional

* fix etcd port displayed

* Further improvements to CKS  (#118)

* Multiple nics support on Ubuntu template

* Multiple nics support on Ubuntu template

* supports allocating IP to the nic when VM is added to another network - no delay

* Add option to select DNS or VR IP as resolver on VPC creation

* Add API param and UI to select option

* Add column on vpc and pass the value on the databags for CsDhcp.py to fix accordingly

* Externalize the CKS Configuration, so that end users can tweak the configuration before deploying the cluster

* Add new directory to c8 packaging for CKS config

* Remove k8s configuration from resources and make it configurable

* Revert "Remove k8s configuration from resources and make it configurable"

This reverts commit d5997033ebe4ba559e6478a64578b894f8e7d3db.

* copy conf to mgmt server and consume them from there

* Remove node from cluster

* Add missing /opt/bin directory requrired by external nodes

* Login to a specific Project view

* add indents

* Fix CKS HA clusters

* Fix build

---------

Co-authored-by: Nicolas Vazquez <[email protected]>

* Add missing headers

* Fix linter

* Address more review comments

* Fix unit test

* Fix scaling case for the same offering

* Revert "Login to a specific Project view"

This reverts commit 95e3756.

* Revert "Fix CKS HA clusters" (#120)

This reverts commit 8dac16a.

* Apply suggestions from code review about user data

Co-authored-by: Suresh Kumar Anaparti <[email protected]>

* Update api/src/main/java/org/apache/cloudstack/api/command/user/userdata/BaseRegisterUserDataCmd.java

Co-authored-by: Suresh Kumar Anaparti <[email protected]>

* Refactor column names and schema path

* Fix scaling for non existing previous offering per node type

* Update node offering entry if there was an existing offering but a global service offering has been provided on scale

---------

Co-authored-by: Pearl Dsilva <[email protected]>
Co-authored-by: Daan Hoogland <[email protected]>
Co-authored-by: Suresh Kumar Anaparti <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

10 participants