Skip to content

Conversation

@sudharshanibm
Copy link
Contributor

@sudharshanibm sudharshanibm commented Aug 18, 2025

Add terraform for provisioning s390x build cluster on ibmcloud

  • k8s-s390x-infra-setup: Terraform resources to set up the necessary infrastructure in an IBM Cloud account, which will serve as prerequisites for the k8s-s390x-build-cluster automation.
  • k8s-s390x-build-cluster: Terraform resources to provision the s390x Build Cluster infrastructure in IBM Cloud.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 18, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @sudharshanibm!

It looks like this is your first PR to kubernetes/k8s.io 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/k8s.io has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @sudharshanibm. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/infra Infrastructure management, infrastructure design, code in infra/ labels Aug 18, 2025
@k8s-ci-robot k8s-ci-robot added the sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. label Aug 18, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sudharshanibm
Once this PR has been reviewed and has the lgtm label, please assign upodroid for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

create_duration = "180s" # Wait 3 minutes for full initialization
}

resource "null_resource" "bastion_setup" {
Copy link
Contributor

@bryantbiggs bryantbiggs Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not very declarative - is this truly required or can the verification be run as a one-off by someone? do we expect things to not be setup correct via Terraform (i.e. - verification here fails) periodically or at all? (if so - that seems problematic and who or what is intended to fix that 😅 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bastion instance’s initial setup and networking configuration are handled through a cloud-init script in user_data. Runtime verification of the bastion’s internal state is treated as an operational concern, and can be performed outside of Terraform via automation/manual checks.

vpc = data.ibm_is_vpc.vpc.id
}

data "ibm_is_security_group" "master_sg" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use an improved alternate terminology? main, primary, control-plane?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated all occurrences of master to control-plane

@@ -0,0 +1,129 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the use of k8s-* throughout the file names, resource names, etc. seems to be redundant (i.e. - its defined in a K8s repo/org so its assumed to be for K8s, no need to re-state that in the names)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bryantbiggs , thanks for pointing that out
I followed the same naming convention since the existing s390x setup is already connected to ArgoCD with k8s-* prefixed names (ref: #8332 ).

See the License for the specific language governing permissions and
limitations under the License.
*/
resource "ibm_is_lb" "k8s_load_balancer" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the names should be viewed as a hierarchy level just below the resource type, but read together. so ibm_is_lb.k8s_load_balancer is essentially IBM load balancer - Kubernetes load balancer

what about something shorter and more descriptive public (ibm_is.lb.public)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated k8s_load_balancer to public

limitations under the License.
*/
resource "ibm_is_lb" "k8s_load_balancer" {
name = "k8s-s390x-lb"
Copy link
Contributor

@bryantbiggs bryantbiggs Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, the name is very redundant. Its in a K8s repo, is this infrastructure deployed in an account that hosts resources for things outside of K8s? If yes = fine, keep the k8s, if not, drop it. Also, the resource is a load balancer so no need to add that to the name (my name is Bryant Biggs not Bryant Biggs human 😅 ). Lastly, I assume this is only for testing on s390x platforms so is that adding anything to the name? Maybe it is, maybe it isn't - I don't have enough context to say. But what about something more descriptive as to its intent and aligned with resources already defined in the repo

kops-infra-ci-name = "kops-infra-ci"

what about s390x-ci? or if you desire the full name k8s-s390x-ci?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this same thinking above applies throughout the change set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated k8s-s390x-lb to k8s-s390x-ci

health_timeout = 2
health_type = "tcp"
health_monitor_url = "/"
health_monitor_port = 6443
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6443 appears in a number of places - good chance it should be a local variable to avoid mis-matches

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored to use a variable api_server_port (default 6443)

name = "k8s-s390x-subnet"
}

data "ibm_is_security_group" "bastion_sg" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data "ibm_is_security_group" "bastion_sg" {
data "ibm_is_security_group" "bastion" {

Security group is already in the resource type - don't repeat in the resource name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already have resource ibm_is_instance bastion defined in bastion.tf, so reusing bastion here would cause a naming conflict

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no - thats a different resource from this data source so no conflict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed bastion_sg --> bastion

}

resource "ibm_is_instance" "control_plane" {
count = var.control_plane.count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a for_each so that we avoid amplifying disruptions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the control-plane to use for_each in nodes.tf

default = "eu-de-1"
}

variable "resource_group_name" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the variable, Thanks!

}
}

variable "bastion" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bastion now generated in locals

}
}

variable "control_plane" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

control-plane now generated in locals

}
}

variable "compute" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute now generated in locals

@@ -0,0 +1,136 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variables are mostly used in modules - I would argue, are these required? Should they be local variables instead (if they are used in multiple places but not intended to be configured dynamically outside of whats stored in git)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, these values are intended to be configurable by automation (Ansible playbooks) rather than hard-coded in the repo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ansible to deploy Terraform - why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for circling back on this. I realize my earlier explanation may have caused some confusion. What I meant is that while we do have downstream automation that consumes certain outputs for cluster setup, that doesn’t mean all of the Terraform configuration itself needs to be exposed as variables.

After revisiting your feedback, I’ve updated the code so that only true external inputs like API key, region, secrets manager ID, counts, profiles, etc., remain as variables. The structured definitions for bastion, control-plane, and compute nodes are now generated in locals, which keeps the interface cleaner and avoids over-exposing internal implementation details

}

resource "ibm_is_instance" "compute" {
count = var.compute.count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise, use for_each

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the compute to use for_each in nodes.tf

}
}

resource "null_resource" "node_setup" {
Copy link
Contributor

@bryantbiggs bryantbiggs Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't appear to be doing anything - can it be removed?

@@ -0,0 +1,43 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputs are intended to be consumed elsewhere - are all of these intended to be consumed somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these outputs are consumed in our Ansible automation for installing Kubernetes. Specifically, the IPs and load balancer hostname are passed into the Ansible playbook ansible-playbook -v -i examples/k8s-build-cluster/hosts.yml install-k8s-ha.yaml -e @group_vars/bastion_configuration --extra-vars @group_vars/all to configure the bastion, control-plane, and worker nodes.


`ibmcloud_api_key` is the only required variable that you must set in order to proceed. You can set this key either by adding it to your `var.tfvars` file or by exporting it as an environment variable.

**Option 1:** Set in `var.tfvars` file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tfvars or variables.tf should not be required here in my opinion - just define the configuration plainly, its not a module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right, since this isn’t a module, ideally we wouldn’t require variables.tf or var.tfvars. however in this setup the TF run creates the Secrets Manager itself, so we need the IBM Cloud API key upfront to provision that resource.
For now, the key must be provided via var.tfvars or environment variable.

Terraform will display a plan of the actions it will take, and you'll be prompted to confirm the execution. Type `yes` to proceed.

**6 .Get Output Information**
<br> Once the infrastructure has been provisioned, use the terraform output command to list details about the provisioned resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for what purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these outputs are used by our ansible automation to install k8s. The IPs and load balancer hostname from Terraform are passed into the playbook (install-k8s-ha.yaml) to configure the bastion, control plane, and worker nodes

See the License for the specific language governing permissions and
limitations under the License.
*/
resource "ibm_resource_group" "build_resource_group" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single resource modules are no bueno - just use the resource as is in the configuration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the resource_group module and now define ibm_resource_group directly in main.tf.
the other modules secrets_manager and vpc consume its ID from there

z_service_cred_secret_name = "k8s-s390x-sm-service-credentials-secret"
}

# Try to find existing Secrets Manager by name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no no no - declarative. Terraform should not be treated as "if this else that". its just "do this"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored the code to make it fully declarative
the secrets manager resource is always created and downstream secrets reference it directly instead of using any conditional logic

See the License for the specific language governing permissions and
limitations under the License.
*/
variable "resource_group_id" {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well thats not nice! we should have the proper definition - description/type/default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with description and type

See the License for the specific language governing permissions and
limitations under the License.
*/
terraform {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need the min required versions of Terraform and the provider

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum required versions are defined in versions.tf

resource "ibm_is_security_group_rule" "master_inbound_from_internal_cidr" {
group = ibm_is_security_group.master_sg.id
direction = "inbound"
remote = "10.243.0.0/24" # adjust based on your CIDR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??? who should adjust based on what CIDR - this is hard coded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced hardcoded CIDR "10.243.0.0/24" with dynamic references to the actual subnet CIDR block: ibm_is_subnet.subnet.ipv4_cidr_block

limitations under the License.
*/

output "secrets_manager_id" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a module so shouldn't required variables.tf or outputs.tf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right that this folder isn’t a reusable module in the TF sense. The reason I’ve added an outputs.tf here is because the k8s-s390x-infra-setup stage produces values secrets_manager_id that are explicitly required by the k8s-s390x-build-cluster stage.

as shown in the build-cluster README, we need to export this ID before running the cluster provisioning

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't copy paste across statefiles though - either refer to the state itself or use the AWS provider to refer to the value directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that makes sense
The intent was to make the secrets_manager_id available for the build-cluster stage, but I see your point about not copy-pasting state outputs manually.
I can switch this to use a terraform_remote_state data source so that the build-cluster config reads the value directly from the infra-setup state instead of requiring a manual export

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bryantbiggs,
I removed outputs.tf and am not using terraform_remote_state because:

  • the infra setup is one-time, while build-clusters are multiple, so each cluster creation shouldn’t depend on the infra state.
  • to avoid state coupling, there’s no need to copy outputs or access remote state, keeping configs independent.

Also, the README has been updated with clear steps to retrieve the Secrets Manager GUID and export it as a Terraform variable for cluster provisioning.

I’d be happy to hear your suggestions or feedback on this approach.

zone = "eu-de-1"
}

provider "ibm" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min required versions of Terraform and providers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum required versions are defined in versions.tf

region = local.region
zone = local.zone
}
provider "ibm" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 3 copies of the same provider if their configurations are all the same? typically aliased providers are required if you are deploying resources into multiple regions or multiple accounts but these are all using the same API key and region

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the duplicate provider blocks


locals {
bastion_nodes = {
for idx in range(1, var.bastion_count + 1) :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still susceptible to the issues created by using index based resource creation (i.e. like count). any disruption to the first bastion will cause all instances to be replaced

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pointing this out. I’ve updated the code to replace the count/range style with a for_each over a map of bastion nodes

using named keys (primary, secondary, ...) now tracks bastions by stable identifiers rather than by index.
if one bastion is changed/removed only that resource is affected instead of forcing all to be recreated

right now I’ve defined a single primary bastion but the pattern allows adding more bastions safely just by extending the map

@sudharshanibm sudharshanibm force-pushed the k8s-s390x-infra branch 8 times, most recently from 75dbb4d to d98ca65 Compare September 23, 2025 10:02
@sudharshanibm sudharshanibm force-pushed the k8s-s390x-infra branch 2 times, most recently from 0eeff67 to 02f6717 Compare October 13, 2025 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/infra Infrastructure management, infrastructure design, code in infra/ cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants