From 41242e45d55aa1456e1fa9ea94ad18b00bceedb9 Mon Sep 17 00:00:00 2001 From: Alberto Morgante Medina Date: Thu, 24 Oct 2024 09:01:20 +0200 Subject: [PATCH] Add some performance configs and 3.1 fixes (#475) * Bump SUMA to 5.0.0 (#467) Signed-off-by: Atanas Dinov * Add Elemental extension chart (#468) Signed-off-by: Atanas Dinov * update with performance config and some 3.1 fixes --------- Signed-off-by: Atanas Dinov Co-authored-by: Atanas Dinov (cherry picked from commit ccf3dc69fcf5dfb806b9eee5a8db244ade3b0fca) --- .../product/atip-automated-provision.adoc | 395 ++++++++++-------- asciidoc/product/atip-features.adoc | 95 +++-- asciidoc/product/atip-requirements.adoc | 25 +- 3 files changed, 319 insertions(+), 196 deletions(-) diff --git a/asciidoc/product/atip-automated-provision.adoc b/asciidoc/product/atip-automated-provision.adoc index 81f3e343..f0c72cc4 100644 --- a/asciidoc/product/atip-automated-provision.adoc +++ b/asciidoc/product/atip-automated-provision.adoc @@ -48,6 +48,15 @@ The following sections describe the different directed network provisioning work * xref:airgap-deployment[] + +[IMPORTANT NOTE] +==== +The following sections show how to prepare the different scenarios for the directed network provisioning workflow using ATIP. +For examples of the different configurations options for deployment (incl. air-gapped environments, DHCP and DHCP-less networks, private container registries, etc.), see the https://github.com/suse-edge/atip/tree/release-3.1/telco-examples/edge-clusters[SUSE ATIP repository]. +==== + +[#single-node] + [#eib-edge-image-connected] === Prepare downstream cluster image for connected scenarios @@ -67,7 +76,11 @@ When running Edge Image Builder, a directory is mounted from the host, so it is * `downstream-cluster-config.yaml` is the image definition file, see <> for more details. * The base image when downloaded is `xz` compressed, which must be uncompressed with `unxz` and copied/moved under the `base-images` folder. * The `network` folder is optional, see <> for more details. -* The custom/scripts directory contains scripts to be run on first-boot; currently a `01-fix-growfs.sh` script is required to resize the OS root partition on deployment +* The `custom/scripts` directory contains scripts to be run on first-boot: + 1. `01-fix-growfs.sh` script is required to resize the OS root partition on deployment + 2. `02-performance.sh` script is optional and can be used to configure the system for performance tuning. + 3. `03-sriov.sh` script is optional and can be used to configure the system for SR-IOV. +* The `custom/files` directory contains the `performance-settings.sh` and `sriov-auto-filler.sh` files to be copied to the image during the image creation process. [,console] ---- @@ -78,7 +91,12 @@ When running Edge Image Builder, a directory is mounted from the host, so it is | └ configure-network.sh └── custom/ └ scripts/ - └ 01-fix-growfs.sh + | └ 01-fix-growfs.sh + | └ 02-performance.sh + | └ 03-sriov.sh + └ files/ + └ performance-settings.sh + └ sriov-auto-filler.sh ---- ===== Downstream cluster image definition file @@ -92,7 +110,7 @@ image: imageType: RAW arch: x86_64 baseImage: SL-Micro.x86_64-6.0-Base-RT-GM2.raw - outputImageName: eibimage-slemicro55rt-telco.raw + outputImageName: eibimage-slmicro60rt-telco.raw operatingSystem: kernelArgs: - ignition.platform.id=openstack @@ -100,6 +118,10 @@ operatingSystem: systemd: disable: - rebootmgr + - transactional-update.timer + - transactional-update-cleanup.timer + - fstrim + - time-sync.target users: - username: root encryptedPassword: ${ROOT_PASSWORD} @@ -143,6 +165,41 @@ growfs() { growfs / ---- +[#add-custom-script-performance] +===== Performance script + +The following optional script (`custom/scripts/02-performance.sh`) can be used to configure the system for performance tuning: + +[,shell] +---- +#!/bin/bash + +# create the folder to extract the artifacts there +mkdir -p /opt/performance-settings + +# copy the artifacts +cp performance-settings.sh /opt/performance-settings/ +---- + +The content of `custom/files/performance-settings.sh` is a script that can be used to configure the system for performance tuning and can be downloaded from the following https://github.com/suse-edge/atip/blob/release-3.1/telco-examples/edge-clusters/dhcp/eib/custom/files/performance-settings.sh[link]. + +[#add-custom-script-sriov] +===== SR-IOV script + +The following optional script (`custom/scripts/03-sriov.sh`) can be used to configure the system for SR-IOV: + +[,shell] +---- +#!/bin/bash + +# create the folder to extract the artifacts there +mkdir -p /opt/sriov +# copy the artifacts +cp sriov-auto-filler.sh /opt/sriov/sriov-auto-filler.sh +---- + +The content of `custom/files/sriov-auto-filler.sh` is a script that can be used to configure the system for SR-IOV and can be downloaded from the following https://github.com/suse-edge/atip/blob/release-3.1/telco-examples/edge-clusters/dhcp/eib/custom/files/sriov-auto-filler.sh[link]. + [NOTE] ==== Add your own custom scripts to be executed during the provisioning process using the same approach. @@ -162,7 +219,7 @@ image: imageType: RAW arch: x86_64 baseImage: SL-Micro.x86_64-6.0-Base-RT-GM2.raw - outputImageName: eibimage-slemicro55rt-telco.raw + outputImageName: eibimage-slmicro60rt-telco.raw operatingSystem: kernelArgs: - ignition.platform.id=openstack @@ -170,6 +227,10 @@ operatingSystem: systemd: disable: - rebootmgr + - transactional-update.timer + - transactional-update-cleanup.timer + - fstrim + - time-sync.target users: - username: root encryptedPassword: ${ROOT_PASSWORD} @@ -178,12 +239,12 @@ operatingSystem: packages: packageList: - jq - - dpdk22 - - dpdk22-tools + - dpdk + - dpdk-tools - libdpdk-23 - pf-bb-config additionalRepos: - - url: https://download.opensuse.org/repositories/isv:/SUSE:/Edge:/Telco/SLEMicro5.5/ + - url: https://download.opensuse.org/repositories/isv:/SUSE:/Edge:/Telco/SL-Micro_6.0_images/ sccRegistrationCode: ${SCC_REGISTRATION_CODE} ---- @@ -245,7 +306,7 @@ podman run --rm --privileged -it -v $PWD:/eib \ build --definition-file downstream-cluster-config.yaml ---- -This creates the output ISO image file named `eibimage-slemicro55rt-telco.raw`, based on the definition described above. +This creates the output ISO image file named `eibimage-slmicro60rt-telco.raw`, based on the definition described above. The output image must then be made available via a webserver, either the media-server container enabled via the <> or some other locally accessible server. In the examples below, we refer to this server as `imagecache.local:8080` @@ -270,8 +331,12 @@ When running Edge Image Builder, a directory is mounted from the host, so it is * `downstream-cluster-airgap-config.yaml` is the image definition file, see <> for more details. * The base image when downloaded is `xz` compressed, which must be uncompressed with `unxz` and copied/moved under the `base-images` folder. * The `network` folder is optional, see <> for more details. -* The `custom/scripts` directory contains scripts to be run on first-boot; currently a `01-fix-growfs.sh` script is required to resize the OS root partition on deployment. For air-gap scenarios, a script `02-airgap.sh` is required to copy the images to the right place during the image creation process. -* The `custom/files` directory contains the `rke2` and the `cni` images to be copied to the image during the image creation process. +* The `custom/scripts` directory contains scripts to be run on first-boot: + 1. `01-fix-growfs.sh` script is required to resize the OS root partition on deployment. + 2. `02-airgap.sh` script is required to copy the images to the right place during the image creation process for air-gapped environments. + 3. `03-performance.sh` script is optional and can be used to configure the system for performance tuning. + 4. `04-sriov.sh` script is optional and can be used to configure the system for SR-IOV. +* The `custom/files` directory contains the `rke2` and the `cni` images to be copied to the image during the image creation process. Also, the optional `performance-settings.sh` and `sriov-auto-filler.sh` files can be included. [,console] ---- @@ -289,9 +354,13 @@ When running Edge Image Builder, a directory is mounted from the host, so it is | └ rke2-images.linux-amd64.tar.zst | └ rke2.linux-amd64.tar.zst | └ sha256sum-amd64.txt + | └ performance-settings.sh + | └ sriov-auto-filler.sh └ scripts/ └ 01-fix-growfs.sh └ 02-airgap.sh + └ 03-performance.sh + └ 04-sriov.sh ---- ===== Downstream cluster image definition file @@ -337,6 +406,41 @@ cp install.sh /opt/ cp rke2-images*.tar.zst rke2.linux-amd64.tar.gz sha256sum-amd64.txt /opt/rke2-artifacts/ ---- +[#add-custom-script-performance2] +===== Performance script + +The following optional script (`custom/scripts/03-performance.sh`) can be used to configure the system for performance tuning: + +[,shell] +---- +#!/bin/bash + +# create the folder to extract the artifacts there +mkdir -p /opt/performance-settings + +# copy the artifacts +cp performance-settings.sh /opt/performance-settings/ +---- + +The content of `custom/files/performance-settings.sh` is a script that can be used to configure the system for performance tuning and can be downloaded from the following https://github.com/suse-edge/atip/blob/release-3.1/telco-examples/edge-clusters/dhcp/eib/custom/files/performance-settings.sh[link]. + +[#add-custom-script-sriov2] +===== SR-IOV script + +The following optional script (`custom/scripts/04-sriov.sh`) can be used to configure the system for SR-IOV: + +[,shell] +---- +#!/bin/bash + +# create the folder to extract the artifacts there +mkdir -p /opt/sriov +# copy the artifacts +cp sriov-auto-filler.sh /opt/sriov/sriov-auto-filler.sh +---- + +The content of `custom/files/sriov-auto-filler.sh` is a script that can be used to configure the system for SR-IOV and can be downloaded from the following https://github.com/suse-edge/atip/blob/release-3.1/telco-examples/edge-clusters/dhcp/eib/custom/files/sriov-auto-filler.sh[link]. + ===== Custom files for air-gap scenarios The `custom/files` directory contains the `rke2` and the `cni` images to be copied to the image during the image creation process. @@ -385,7 +489,7 @@ edge/sriov-crd-chart:1.3.0 EOF ---- + -.. Generate a local tarball file using the following https://github.com/suse-edge/fleet-examples/blob/release-3.0/scripts/day2/edge-save-oci-artefacts.sh[script] and the list created above: +.. Generate a local tarball file using the following https://github.com/suse-edge/fleet-examples/blob/release-3.1/scripts/day2/edge-save-oci-artefacts.sh[script] and the list created above: + [,shell] ---- @@ -397,7 +501,7 @@ a edge-release-oci-tgz-20240705/sriov-network-operator-chart-1.3.0.tgz a edge-release-oci-tgz-20240705/sriov-crd-chart-1.3.0.tgz ---- + -.. Upload your tarball file to your private registry (e.g. `myregistry:5000`) using the following https://github.com/suse-edge/fleet-examples/blob/release-3.0/scripts/day2/edge-load-oci-artefacts.sh[script] to preload your registry with the helm chart OCI images downloaded in the previous step: +.. Upload your tarball file to your private registry (e.g. `myregistry:5000`) using the following https://github.com/suse-edge/fleet-examples/blob/release-3.1/scripts/day2/edge-load-oci-artefacts.sh[script] to preload your registry with the helm chart OCI images downloaded in the previous step: + [,shell] ---- @@ -407,37 +511,37 @@ $ ./edge-load-oci-artefacts.sh -ad edge-release-oci-tgz-20240705 -r myregistry:5 . Preload with the rest of the images required for SR-IOV: + -.. In this case, we must include the `sr-iov container images for telco workloads (e.g. as a reference, you could get them from https://github.com/suse-edge/charts/blob/release-3.0/charts/sriov-network-operator/1.3.0%2Bup0.1.0/values.yaml[helm-chart values]) +.. In this case, we must include the `sr-iov container images for telco workloads (e.g. as a reference, you could get them from https://github.com/suse-edge/charts/blob/release-3.1/charts/sriov-network-operator/1.3.0%2Bup0.1.0/values.yaml[helm-chart values]) + [,shell] ---- $ cat > edge-release-images.txt <> or some other locally accessible server. In the examples below, we refer to this server as `imagecache.local:8080`. @@ -505,7 +609,7 @@ data: apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: - name: flexran-demo + name: example-demo labels: cluster-role: control-plane spec: @@ -643,6 +747,27 @@ spec: ExecStartPost=/bin/sh -c "umount /mnt" [Install] WantedBy=multi-user.target + storage: + files: + # https://docs.rke2.io/networking/multus_sriov#using-multus-with-cilium + - path: /var/lib/rancher/rke2/server/manifests/rke2-cilium-config.yaml + overwrite: true + contents: + inline: | + apiVersion: helm.cattle.io/v1 + kind: HelmChartConfig + metadata: + name: rke2-cilium + namespace: kube-system + spec: + valuesContent: |- + cni: + exclusive: false + mode: 0644 + user: + name: root + group: + name: root kubelet: extraArgs: - provider-id=metal3://BAREMETALHOST_UUID @@ -672,10 +797,10 @@ spec: matchLabels: cluster-role: control-plane image: - checksum: http://imagecache.local:8080/eibimage-slemicro55rt-telco.raw.sha256 + checksum: http://imagecache.local:8080/eibimage-slmicro60rt-telco.raw.sha256 checksumType: sha256 format: raw - url: http://imagecache.local:8080/eibimage-slemicro55rt-telco.raw + url: http://imagecache.local:8080/eibimage-slmicro60rt-telco.raw ---- The `Metal3DataTemplate` object specifies the `metaData` for the downstream cluster. @@ -906,6 +1031,25 @@ spec: WantedBy=multi-user.target storage: files: + # https://docs.rke2.io/networking/multus_sriov#using-multus-with-cilium + - path: /var/lib/rancher/rke2/server/manifests/rke2-cilium-config.yaml + overwrite: true + contents: + inline: | + apiVersion: helm.cattle.io/v1 + kind: HelmChartConfig + metadata: + name: rke2-cilium + namespace: kube-system + spec: + valuesContent: |- + cni: + exclusive: false + mode: 0644 + user: + name: root + group: + name: root - path: /var/lib/rancher/rke2/server/manifests/endpoint-copier-operator.yaml overwrite: true contents: @@ -1014,10 +1158,10 @@ spec: matchLabels: cluster-role: control-plane image: - checksum: http://imagecache.local:8080/eibimage-slemicro55rt-telco.raw.sha256 + checksum: http://imagecache.local:8080/eibimage-slmicro60rt-telco.raw.sha256 checksumType: sha256 format: raw - url: http://imagecache.local:8080/eibimage-slemicro55rt-telco.raw + url: http://imagecache.local:8080/eibimage-slmicro60rt-telco.raw ---- The `Metal3DataTemplate` object specifies the `metaData` for the downstream cluster. @@ -1137,7 +1281,7 @@ data: apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: - name: flexran-demo + name: example-demo labels: cluster-role: control-plane spec: @@ -1231,26 +1375,26 @@ To make the process clear, the changes required on that block (`RKE2ControlPlane |=== | Parameter | Value | Description -| isolcpus| 1-30,33-62| Isolate the cores 1-30 and 33-62. +| isolcpus| domain,nohz,managed_irq,1-30,33-62| Isolate the cores 1-30 and 33-62. | skew_tick| 1 | Allows the kernel to skew the timer interrupts across the isolated CPUs. | nohz| on | Allows the kernel to run the timer tick on a single CPU when the system is idle. | nohz_full| 1-30,33-62 | kernel boot parameter is the current main interface to configure full dynticks along with CPU Isolation. | rcu_nocbs| 1-30,33-62 | Allows the kernel to run the RCU callbacks on a single CPU when the system is idle. -| kthread_cpus| 0,31,32,63 | Allows the kernel to run the kthreads on a single CPU when the system is idle. | irqaffinity| 0,31,32,63 | Allows the kernel to run the interrupts on a single CPU when the system is idle. -| processor.max_cstate| 1 | Prevents the CPU from dropping into a sleep state when idle. -| intel_idle.max_cstate| 0 | Disables the intel_idle driver and allows acpi_idle to be used. +| idle| poll | Minimizes the latency of exiting the idle state. | iommu | pt | Allows to use vfio for the dpdk interfaces. | intel_iommu | on | Enables the use of vfio for VFs. | hugepagesz | 1G | Allows to set the size of huge pages to 1 G. | hugepages | 40 | Number of huge pages defined before. | default_hugepagesz| 1G | Default value to enable huge pages. +| nowatchdog | | Disables the watchdog. +| nmi_watchdog | 0 | Disables the NMI watchdog. |=== * The following systemd services are used to enable the following: ** `rke2-preinstall.service` to replace automatically the `BAREMETALHOST_UUID` and `node-name` during the provisioning process using the Ironic information. - ** `cpu-performance.service` to enable the CPU performance tuning. The `$\{CPU_FREQUENCY\}` has to be replaced with the real values (for example, `2500000` to set the CPU frequency to `2.5GHz`). ** `cpu-partitioning.service` to enable the isolation cores of the `CPU` (for example, `1-30,33-62`). + ** `performance-settings.service` to enable the CPU performance tuning. ** `sriov-custom-auto-vfs.service` to install the `sriov` Helm chart, wait until custom resources are created and run the `/var/sriov-auto-filler.sh` to replace the values in the config map `sriov-custom-auto-config` and create the `sriovnetworknodepolicy` to be used by the workloads. * The `$\{RKE2_VERSION\}` is the version of `RKE2` to be used replacing this value (for example, `v1.28.13+rke2r1`). @@ -1271,7 +1415,7 @@ spec: name: single-node-cluster-controlplane replicas: 1 serverConfig: - cni: cilium + cni: calico cniMultusEnable: true preRKE2Commands: - modprobe vfio-pci enable_sriov=1 disable_idle_d3=1 @@ -1343,75 +1487,26 @@ spec: targetNamespace: sriov-network-operator version: 1.3.0 createNamespace: true - - path: /var/sriov-auto-filler.sh - overwrite: true - contents: - inline: | - #!/bin/bash - cat <<- EOF > /var/sriov-networkpolicy-template.yaml - apiVersion: sriovnetwork.openshift.io/v1 - kind: SriovNetworkNodePolicy - metadata: - name: atip-RESOURCENAME - namespace: sriov-network-operator - spec: - nodeSelector: - feature.node.kubernetes.io/network-sriov.capable: "true" - resourceName: RESOURCENAME - deviceType: DRIVER - numVfs: NUMVF - mtu: 1500 - nicSelector: - pfNames: ["PFNAMES"] - deviceID: "DEVICEID" - vendor: "VENDOR" - rootDevices: - - PCIADDRESS - EOF - - export KUBECONFIG=/etc/rancher/rke2/rke2.yaml; export KUBECTL=/var/lib/rancher/rke2/bin/kubectl - while [ $(${KUBECTL} --kubeconfig=${KUBECONFIG} get sriovnetworknodestates.sriovnetwork.openshift.io -n sriov-network-operator -ojson | jq -r '.items[].status.syncStatus') != "Succeeded" ]; do sleep 1; done - input=$(${KUBECTL} --kubeconfig=${KUBECONFIG} get cm sriov-custom-auto-config -n kube-system -ojson | jq -r '.data."config.json"') - jq -c '.[]' <<< $input | while read i; do - interface=$(echo $i | jq -r '.interface') - pfname=$(echo $i | jq -r '.pfname') - pciaddress=$(${KUBECTL} --kubeconfig=${KUBECONFIG} get sriovnetworknodestates.sriovnetwork.openshift.io -n sriov-network-operator -ojson | jq -r ".items[].status.interfaces[]|select(.name==\"$interface\")|.pciAddress") - vendor=$(${KUBECTL} --kubeconfig=${KUBECONFIG} get sriovnetworknodestates.sriovnetwork.openshift.io -n sriov-network-operator -ojson | jq -r ".items[].status.interfaces[]|select(.name==\"$interface\")|.vendor") - deviceid=$(${KUBECTL} --kubeconfig=${KUBECONFIG} get sriovnetworknodestates.sriovnetwork.openshift.io -n sriov-network-operator -ojson | jq -r ".items[].status.interfaces[]|select(.name==\"$interface\")|.deviceID") - resourceName=$(echo $i | jq -r '.resourceName') - driver=$(echo $i | jq -r '.driver') - sed -e "s/RESOURCENAME/$resourceName/g" \ - -e "s/DRIVER/$driver/g" \ - -e "s/PFNAMES/$pfname/g" \ - -e "s/VENDOR/$vendor/g" \ - -e "s/DEVICEID/$deviceid/g" \ - -e "s/PCIADDRESS/$pciaddress/g" \ - -e "s/NUMVF/$(echo $i | jq -r '.numVFsToCreate')/g" /var/sriov-networkpolicy-template.yaml > /var/lib/rancher/rke2/server/manifests/$resourceName.yaml - done - mode: 0755 - user: - name: root - group: - name: root kernel_arguments: should_exist: - intel_iommu=on - - intel_pstate=passive - - processor.max_cstate=1 - - intel_idle.max_cstate=0 - iommu=pt + - idle=poll - mce=off - hugepagesz=1G hugepages=40 - hugepagesz=2M hugepages=0 - default_hugepagesz=1G - - kthread_cpus=${NON-ISOLATED_CPU_CORES} - irqaffinity=${NON-ISOLATED_CPU_CORES} - - isolcpus=${ISOLATED_CPU_CORES} + - isolcpus=domain,nohz,managed_irq,${ISOLATED_CPU_CORES} - nohz_full=${ISOLATED_CPU_CORES} - rcu_nocbs=${ISOLATED_CPU_CORES} - rcu_nocb_poll - nosoftlockup + - nowatchdog - nohz=on + - nmi_watchdog=0 + - skew_tick=1 + - quiet systemd: units: - name: rke2-preinstall.service @@ -1431,35 +1526,32 @@ spec: ExecStartPost=/bin/sh -c "umount /mnt" [Install] WantedBy=multi-user.target - - name: cpu-performance.service + - name: cpu-partitioning.service enabled: true contents: | [Unit] - Description=CPU perfomance + Description=cpu-partitioning Wants=network-online.target After=network.target network-online.target [Service] + Type=oneshot User=root - Type=forking - TimeoutStartSec=900 - ExecStart=/bin/sh -c "cpupower frequency-set -g performance; cpupower frequency-set -u ${CPU_FREQUENCY}; cpupower frequency-set -d ${CPU_FREQUENCY}" - RemainAfterExit=yes - KillMode=process + ExecStart=/bin/sh -c "echo isolated_cores=${ISOLATED_CPU_CORES} > /etc/tuned/cpu-partitioning-variables.conf" + ExecStartPost=/bin/sh -c "tuned-adm profile cpu-partitioning" + ExecStartPost=/bin/sh -c "systemctl enable tuned.service" [Install] WantedBy=multi-user.target - - name: cpu-partitioning.service + - name: performance-settings.service enabled: true contents: | [Unit] - Description=cpu-partitioning + Description=performance-settings Wants=network-online.target - After=network.target network-online.target + After=network.target network-online.target cpu-partitioning.service [Service] Type=oneshot User=root - ExecStart=/bin/sh -c "echo isolated_cores=${ISOLATED_CPU_CORES} > /etc/tuned/cpu-partitioning-variables.conf" - ExecStartPost=/bin/sh -c "tuned-adm profile cpu-partitioning" - ExecStartPost=/bin/sh -c "systemctl enable tuned.service" + ExecStart=/bin/sh -c "/opt/performance-settings/performance-settings.sh" [Install] WantedBy=multi-user.target - name: sriov-custom-auto-vfs.service @@ -1475,7 +1567,7 @@ spec: TimeoutStartSec=900 ExecStart=/bin/sh -c "while ! /var/lib/rancher/rke2/bin/kubectl --kubeconfig=/etc/rancher/rke2/rke2.yaml wait --for condition=ready nodes --all ; do sleep 2 ; done" ExecStartPost=/bin/sh -c "while [ $(/var/lib/rancher/rke2/bin/kubectl --kubeconfig=/etc/rancher/rke2/rke2.yaml get sriovnetworknodestates.sriovnetwork.openshift.io --ignore-not-found --no-headers -A | wc -l) -eq 0 ]; do sleep 1; done" - ExecStartPost=/bin/sh -c "/var/sriov-auto-filler.sh" + ExecStartPost=/bin/sh -c "/opt/sriov/sriov-auto-filler.sh" RemainAfterExit=yes KillMode=process [Install] @@ -1565,7 +1657,8 @@ spec: namespace: default name: private-registry-cert serverConfig: - cni: cilium + cni: calico + cniMultusEnable: true agentConfig: format: ignition additionalUserData: @@ -1686,7 +1779,7 @@ spec: name: private-registry-cert insecureSkipVerify: false serverConfig: - cni: cilium + cni: calico cniMultusEnable: true preRKE2Commands: - modprobe vfio-pci enable_sriov=1 disable_idle_d3=1 @@ -1803,75 +1896,26 @@ spec: name: root group: name: root - - path: /var/sriov-auto-filler.sh - overwrite: true - contents: - inline: | - #!/bin/bash - cat <<- EOF > /var/sriov-networkpolicy-template.yaml - apiVersion: sriovnetwork.openshift.io/v1 - kind: SriovNetworkNodePolicy - metadata: - name: atip-RESOURCENAME - namespace: sriov-network-operator - spec: - nodeSelector: - feature.node.kubernetes.io/network-sriov.capable: "true" - resourceName: RESOURCENAME - deviceType: DRIVER - numVfs: NUMVF - mtu: 1500 - nicSelector: - pfNames: ["PFNAMES"] - deviceID: "DEVICEID" - vendor: "VENDOR" - rootDevices: - - PCIADDRESS - EOF - - export KUBECONFIG=/etc/rancher/rke2/rke2.yaml; export KUBECTL=/var/lib/rancher/rke2/bin/kubectl - while [ $(${KUBECTL} --kubeconfig=${KUBECONFIG} get sriovnetworknodestates.sriovnetwork.openshift.io -n sriov-network-operator -ojson | jq -r '.items[].status.syncStatus') != "Succeeded" ]; do sleep 1; done - input=$(${KUBECTL} --kubeconfig=${KUBECONFIG} get cm sriov-custom-auto-config -n sriov-network-operator -ojson | jq -r '.data."config.json"') - jq -c '.[]' <<< $input | while read i; do - interface=$(echo $i | jq -r '.interface') - pfname=$(echo $i | jq -r '.pfname') - pciaddress=$(${KUBECTL} --kubeconfig=${KUBECONFIG} get sriovnetworknodestates.sriovnetwork.openshift.io -n sriov-network-operator -ojson | jq -r ".items[].status.interfaces[]|select(.name==\"$interface\")|.pciAddress") - vendor=$(${KUBECTL} --kubeconfig=${KUBECONFIG} get sriovnetworknodestates.sriovnetwork.openshift.io -n sriov-network-operator -ojson | jq -r ".items[].status.interfaces[]|select(.name==\"$interface\")|.vendor") - deviceid=$(${KUBECTL} --kubeconfig=${KUBECONFIG} get sriovnetworknodestates.sriovnetwork.openshift.io -n sriov-network-operator -ojson | jq -r ".items[].status.interfaces[]|select(.name==\"$interface\")|.deviceID") - resourceName=$(echo $i | jq -r '.resourceName') - driver=$(echo $i | jq -r '.driver') - sed -e "s/RESOURCENAME/$resourceName/g" \ - -e "s/DRIVER/$driver/g" \ - -e "s/PFNAMES/$pfname/g" \ - -e "s/VENDOR/$vendor/g" \ - -e "s/DEVICEID/$deviceid/g" \ - -e "s/PCIADDRESS/$pciaddress/g" \ - -e "s/NUMVF/$(echo $i | jq -r '.numVFsToCreate')/g" /var/sriov-networkpolicy-template.yaml > /var/lib/rancher/rke2/server/manifests/$resourceName.yaml - done - mode: 0755 - user: - name: root - group: - name: root kernel_arguments: should_exist: - intel_iommu=on - - intel_pstate=passive - - processor.max_cstate=1 - - intel_idle.max_cstate=0 - iommu=pt + - idle=poll - mce=off - hugepagesz=1G hugepages=40 - hugepagesz=2M hugepages=0 - default_hugepagesz=1G - - kthread_cpus=${NON-ISOLATED_CPU_CORES} - irqaffinity=${NON-ISOLATED_CPU_CORES} - - isolcpus=${ISOLATED_CPU_CORES} + - isolcpus=domain,nohz,managed_irq,${ISOLATED_CPU_CORES} - nohz_full=${ISOLATED_CPU_CORES} - rcu_nocbs=${ISOLATED_CPU_CORES} - rcu_nocb_poll - nosoftlockup + - nowatchdog - nohz=on + - nmi_watchdog=0 + - skew_tick=1 + - quiet systemd: units: - name: rke2-preinstall.service @@ -1906,6 +1950,19 @@ spec: ExecStartPost=/bin/sh -c "systemctl enable tuned.service" [Install] WantedBy=multi-user.target + - name: performance-settings.service + enabled: true + contents: | + [Unit] + Description=performance-settings + Wants=network-online.target + After=network.target network-online.target cpu-partitioning.service + [Service] + Type=oneshot + User=root + ExecStart=/bin/sh -c "/opt/performance-settings/performance-settings.sh" + [Install] + WantedBy=multi-user.target - name: sriov-custom-auto-vfs.service enabled: true contents: | @@ -1919,7 +1976,7 @@ spec: TimeoutStartSec=900 ExecStart=/bin/sh -c "while ! /var/lib/rancher/rke2/bin/kubectl --kubeconfig=/etc/rancher/rke2/rke2.yaml wait --for condition=ready nodes --all ; do sleep 2 ; done" ExecStartPost=/bin/sh -c "while [ $(/var/lib/rancher/rke2/bin/kubectl --kubeconfig=/etc/rancher/rke2/rke2.yaml get sriovnetworknodestates.sriovnetwork.openshift.io --ignore-not-found --no-headers -A | wc -l) -eq 0 ]; do sleep 1; done" - ExecStartPost=/bin/sh -c "/var/sriov-auto-filler.sh" + ExecStartPost=/bin/sh -c "/opt/sriov/sriov-auto-filler.sh" RemainAfterExit=yes KillMode=process [Install] diff --git a/asciidoc/product/atip-features.adoc b/asciidoc/product/atip-features.adoc index c1cd235a..36a5d290 100644 --- a/asciidoc/product/atip-features.adoc +++ b/asciidoc/product/atip-features.adoc @@ -17,6 +17,7 @@ The directed network provisioning deployment method is used, as described in the The following topics are covered in this section: * <>: Kernel image to be used by the real-time kernel. +* <>: Kernel arguments to be used by the real-time kernel for maximum performance and low latency running telco workloads. * <>: Tuned configuration to be used by the real-time kernel. * <>: CNI configuration to be used by the Kubernetes cluster. * <>: SR-IOV configuration to be used by the Kubernetes workloads. @@ -60,6 +61,38 @@ In our case, if you have installed a real-time image like `SLE Micro RT`, kernel For more information about the real-time kernel, visit https://www.suse.com/products/realtime/[SUSE Real Time]. ==== +[#kernel-args] +=== Kernel arguments for low latency and high performance + +The kernel arguments are important to be configured to enable the real-time kernel to work properly giving the best performance and low latency to run telco workloads. There are some important concepts to keep in mind when configuring the kernel arguments for this use case: + +* Remove `kthread_cpus` when using SUSE real-time kernel. This parameter controls on which CPUs kernel threads are created. It also controls which CPUs are allowed for PID 1 and for loading kernel modules (the kmod user-space helper). This parameter is not +recognized and does not have any effect. + +* Add `domain,nohz,managed_irq` flags to `isolcpus` kernel argument. Without any flags, `isolcpus` is equivalent to specifying only the `domain` flag. This isolates the specified CPUs from scheduling, including kernel tasks. The `nohz` flag stops the scheduler tick on the specified CPUs (if only one task is runnable on a CPU), and the `managed_irq` flag avoids routing +managed external (device) interrupts at the specified CPUs. + +* Remove `intel_pstate=passive`. This option configures `intel_pstate` to work with generic cpufreq governors, but to make this work, it disables hardware-managed P-states (`HWP`) as a side effect. To reduce the hardware latency, this option is not recommended for real-time workloads. + +* Replace `intel_idle.max_cstate=0 processor.max_cstate=1` with `idle=poll`. To avoid C-State transitions, the `idle=poll` option is used to disable the C-State transitions and keep the CPU in the highest C-State. The `intel_idle.max_cstate=0` option disables `intel_idle`, so `acpi_idle` is used, and `acpi_idle.max_cstate=1` then sets max C-state for acpi_idle. +On x86_64 architectures, the first ACPI C-State is always `POLL`, but it uses a `poll_idle()` function, which may introduce some tiny latency by reading the clock periodically, and restarting the main loop in `do_idle()` after a timeout (this also involves clearing and setting the `TIF_POLL` task flag). +In contrast, `idle=poll` runs in a tight loop, busy-waiting for a task to be rescheduled. This minimizes the latency of exiting the idle state, but at the cost of keeping the CPU running at full speed in the idle thread. + +* Disable C1E in BIOS. This option is important to disable the C1E state in the BIOS to avoid the CPU from entering the C1E state when idle. The C1E state is a low-power state that can introduce latency when the CPU is idle. + +* Add `nowatchdog` to disable the soft-lockup watchdog which is implemented as a timer running in the timer hard-interrupt context. When it expires (i.e. a soft lockup is detected), it will print a warning (in the hard interrupt context), running any latency targets. Even if it never expires, it goes onto the timer list, slightly increasing the overhead of every timer interrupt. +This option also disables the NMI watchdog, so NMIs cannot interfere. + +* Add `nmi_watchdog=0`. This option disables only the NMI watchdog. + +This is an example of the kernel argument list including the aforementioned adjustments: + +[,shell] +---- +GRUB_CMDLINE_LINUX="skew_tick=1 BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=77b713de-5cc7-4d4c-8fc6-f5eca0a43cf9 rd.timeout=60 rd.retry=45 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepages=0 hugepages=40 hugepagesz=1G hugepagesz=2M ignition.platform.id=openstack intel_iommu=on iommu=pt irqaffinity=0,19,20,39 isolcpus=domain,nohz,managed_irq,1-18,21-38 mce=off nohz=on net.ifnames=0 nmi_watchdog=0 nohz_full=1-18,21-38 nosoftlockup nowatchdog quiet rcu_nocb_poll rcu_nocbs=1-18,21-38 rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1" +---- + + [#cpu-tuned-configuration] === CPU tuned configuration @@ -71,7 +104,7 @@ To enable and configure this feature, the first thing is to create a profile for ---- $ echo "export tuned_params" >> /etc/grub.d/00_tuned -$ echo "isolated_cores=1-30,33-62" >> /etc/tuned/cpu-partitioning-variables.conf +$ echo "isolated_cores=1-18,21-38" >> /etc/tuned/cpu-partitioning-variables.conf $ tuned-adm profile cpu-partitioning Tuned (re)started, changes applied. @@ -86,8 +119,8 @@ The following options are important to be customized with your current hardware | parameter | value | description | isolcpus -| 1-30,33-62 -| Isolate the cores 1-30 and 33-62 +| domain,nohz,managed_irq,1-18,21-38 +| Isolate the cores 1-18 and 21-38 | skew_tick | 1 @@ -98,28 +131,28 @@ The following options are important to be customized with your current hardware | This option allows the kernel to run the timer tick on a single CPU when the system is idle. | nohz_full -| 1-30,33-62 +| 1-18,21-38 | kernel boot parameter is the current main interface to configure full dynticks along with CPU Isolation. | rcu_nocbs -| 1-30,33-62 +| 1-18,21-38 | This option allows the kernel to run the RCU callbacks on a single CPU when the system is idle. -| kthread_cpus -| 0,31,32,63 -| This option allows the kernel to run the kthreads on a single CPU when the system is idle. - | irqaffinity -| 0,31,32,63 +| 0,19,20,39 | This option allows the kernel to run the interrupts on a single CPU when the system is idle. -| processor.max_cstate -| 1 -| This option prevents the CPU from dropping into a sleep state when idle +| idle +| poll +| This minimizes the latency of exiting the idle state, but at the cost of keeping the CPU running at full speed in the idle thread. -| intel_idle.max_cstate +| nmi_watchdog | 0 -| This option disables the intel_idle driver and allows acpi_idle to be used +| This option disables only the NMI watchdog. + +| nowatchdog +| +| This option disables the soft-lockup watchdog which is implemented as a timer running in the timer hard-interrupt context. |=== With the values shown above, we are isolating 60 cores, and we are using four cores for the OS. @@ -130,7 +163,7 @@ Edit the `/etc/default/grub` file and add the parameters mentioned above: [,shell] ---- -GRUB_CMDLINE_LINUX="intel_iommu=on intel_pstate=passive processor.max_cstate=1 intel_idle.max_cstate=0 iommu=pt usbcore.autosuspend=-1 selinux=0 enforcing=0 nmi_watchdog=0 crashkernel=auto softlockup_panic=0 audit=0 mce=off hugepagesz=1G hugepages=40 hugepagesz=2M hugepages=0 default_hugepagesz=1G kthread_cpus=0,31,32,63 irqaffinity=0,31,32,63 isolcpus=1-30,33-62 skew_tick=1 nohz_full=1-30,33-62 rcu_nocbs=1-30,33-62 rcu_nocb_poll" +GRUB_CMDLINE_LINUX="skew_tick=1 BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=77b713de-5cc7-4d4c-8fc6-f5eca0a43cf9 rd.timeout=60 rd.retry=45 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepages=0 hugepages=40 hugepagesz=1G hugepagesz=2M ignition.platform.id=openstack intel_iommu=on iommu=pt irqaffinity=0,19,20,39 isolcpus=domain,nohz,managed_irq,1-18,21-38 mce=off nohz=on net.ifnames=0 nmi_watchdog=0 nohz_full=1-18,21-38 nosoftlockup nowatchdog quiet rcu_nocb_poll rcu_nocbs=1-18,21-38 rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1" ---- Update the GRUB configuration: @@ -147,6 +180,16 @@ To validate that the parameters are applied after the reboot, the following comm $ cat /proc/cmdline ---- +There is another script that can be used to tune the CPU configuration, which basically is doing the following steps: + +* Set the CPU governor to `performance`. +* Unset the timer migration to the isolated CPUs. +* Migrate the kdaemon threads to the housekeeping CPUs. +* Set the isolated CPUs latency to the lowest possible value. +* Delay the vmstat updates to 300 seconds. + +The script is available at https://raw.githubusercontent.com/suse-edge/atip/refs/heads/release-3.1/telco-examples/edge-clusters/dhcp-less/eib/custom/files/performance-settings.sh[SUSE ATIP Github repository - performance-settings.sh]. + [#cni-configuration] === CNI Configuration @@ -324,7 +367,7 @@ spec: serviceAccountName: sriov-device-plugin containers: - name: kube-sriovdp - image: rancher/hardened-sriov-network-device-plugin:v3.5.1-build20231009-amd64 + image: rancher/hardened-sriov-network-device-plugin:v3.7.0-build20240816 imagePullPolicy: IfNotPresent args: - --log-dir=sriovdp @@ -658,7 +701,7 @@ The following steps will show how to enable `DPDK` and how to create `VFs` from [,shell] ---- -$ transactional-update pkg install dpdk22 dpdk22-tools libdpdk-23 +$ transactional-update pkg install dpdk dpdk-tools libdpdk-23 $ reboot ---- @@ -683,7 +726,7 @@ To enable the parameters, add them to the `/etc/default/grub` file: [,shell] ---- -GRUB_CMDLINE_LINUX="intel_iommu=on intel_pstate=passive processor.max_cstate=1 intel_idle.max_cstate=0 iommu=pt usbcore.autosuspend=-1 selinux=0 enforcing=0 nmi_watchdog=0 crashkernel=auto softlockup_panic=0 audit=0 mce=off hugepagesz=1G hugepages=40 hugepagesz=2M hugepages=0 default_hugepagesz=1G kthread_cpus=0,31,32,63 irqaffinity=0,31,32,63 isolcpus=1-30,33-62 skew_tick=1 nohz_full=1-30,33-62 rcu_nocbs=1-30,33-62 rcu_nocb_poll" +GRUB_CMDLINE_LINUX="skew_tick=1 BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=77b713de-5cc7-4d4c-8fc6-f5eca0a43cf9 rd.timeout=60 rd.retry=45 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepages=0 hugepages=40 hugepagesz=1G hugepagesz=2M ignition.platform.id=openstack intel_iommu=on iommu=pt irqaffinity=0,19,20,39 isolcpus=domain,nohz,managed_irq,1-18,21-38 mce=off nohz=on net.ifnames=0 nmi_watchdog=0 nohz_full=1-18,21-38 nosoftlockup nowatchdog quiet rcu_nocb_poll rcu_nocbs=1-18,21-38 rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1" ---- Update the GRUB configuration and reboot the system to apply the changes: @@ -774,7 +817,7 @@ Modify the GRUB file `/etc/default/grub` to add them to the kernel command line: [,shell] ---- -GRUB_CMDLINE_LINUX="intel_iommu=on intel_pstate=passive processor.max_cstate=1 intel_idle.max_cstate=0 iommu=pt usbcore.autosuspend=-1 selinux=0 enforcing=0 nmi_watchdog=0 crashkernel=auto softlockup_panic=0 audit=0 mce=off hugepagesz=1G hugepages=40 hugepagesz=2M hugepages=0 default_hugepagesz=1G kthread_cpus=0,31,32,63 irqaffinity=0,31,32,63 isolcpus=1-30,33-62 skew_tick=1 nohz_full=1-30,33-62 rcu_nocbs=1-30,33-62 rcu_nocb_poll" +GRUB_CMDLINE_LINUX="skew_tick=1 BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=77b713de-5cc7-4d4c-8fc6-f5eca0a43cf9 rd.timeout=60 rd.retry=45 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepages=0 hugepages=40 hugepagesz=1G hugepagesz=2M ignition.platform.id=openstack intel_iommu=on iommu=pt irqaffinity=0,19,20,39 isolcpus=domain,nohz,managed_irq,1-18,21-38 mce=off nohz=on net.ifnames=0 nmi_watchdog=0 nohz_full=1-18,21-38 nosoftlockup nowatchdog quiet rcu_nocb_poll rcu_nocbs=1-18,21-38 rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1" ---- Update the GRUB configuration and reboot the system to apply the changes: @@ -885,7 +928,7 @@ Modify the GRUB file `/etc/default/grub` to add them to the kernel command line: [,shell] ---- -GRUB_CMDLINE_LINUX="intel_iommu=on intel_pstate=passive processor.max_cstate=1 intel_idle.max_cstate=0 iommu=pt usbcore.autosuspend=-1 selinux=0 enforcing=0 nmi_watchdog=0 crashkernel=auto softlockup_panic=0 audit=0 mce=off hugepagesz=1G hugepages=40 hugepagesz=2M hugepages=0 default_hugepagesz=1G kthread_cpus=0,31,32,63 irqaffinity=0,31,32,63 isolcpus=1-30,33-62 skew_tick=1 nohz_full=1-30,33-62 rcu_nocbs=1-30,33-62 rcu_nocb_poll" +GRUB_CMDLINE_LINUX="skew_tick=1 BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=77b713de-5cc7-4d4c-8fc6-f5eca0a43cf9 rd.timeout=60 rd.retry=45 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepages=0 hugepages=40 hugepagesz=1G hugepagesz=2M ignition.platform.id=openstack intel_iommu=on iommu=pt irqaffinity=0,19,20,39 isolcpus=domain,nohz,managed_irq,1-18,21-38 mce=off nohz=on net.ifnames=0 nmi_watchdog=0 nohz_full=1-18,21-38 nosoftlockup nowatchdog quiet rcu_nocb_poll rcu_nocbs=1-18,21-38 rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1" ---- Update the GRUB configuration and reboot the system to apply the changes: @@ -1077,9 +1120,10 @@ metadata: name: metallb namespace: kube-system spec: - repo: https://metallb.github.io/metallb/ - chart: metallb + chart: oci://registry.suse.com/edge/3.1/metallb-chart targetNamespace: metallb-system + version: 0.14.9 + createNamespace: true --- apiVersion: helm.cattle.io/v1 kind: HelmChart @@ -1087,9 +1131,10 @@ metadata: name: endpoint-copier-operator namespace: kube-system spec: - repo: https://suse-edge.github.io/endpoint-copier-operator - chart: endpoint-copier-operator + chart: oci://registry.suse.com/edge/3.1/endpoint-copier-operator-chart targetNamespace: endpoint-copier-operator + version: 0.2.1 + createNamespace: true EOF ---- diff --git a/asciidoc/product/atip-requirements.adoc b/asciidoc/product/atip-requirements.adoc index 703d28cb..9ecbeeba 100644 --- a/asciidoc/product/atip-requirements.adoc +++ b/asciidoc/product/atip-requirements.adoc @@ -66,9 +66,11 @@ Some external services like `DHCP`, `DNS`, etc. could be required depending on t * **Disconnected / air-gap environment**: In this case, the ATIP nodes will not have Internet IP connectivity and additional services will be required to locally mirror content required by the ATIP directed network provisioning workflow. * **File server**: A file server is used to store the OS images to be provisioned on the ATIP nodes during the directed network provisioning workflow. The `metal^3^` Helm chart can deploy a media server to store the OS images — check the following xref:metal3-media-server[section], but it is also possible to use an existing local webserver. -=== Disabling rebootmgr +=== Disabling systemd services -`rebootmgr` is a service which allows to configure a strategy for reboot when the system has pending updates. +For Telco workloads, it is important to disable or configure properly some of the services running on the nodes to avoid any impact on the workload performance running on the nodes (latency). + +* `rebootmgr` is a service which allows to configure a strategy for reboot when the system has pending updates. For Telco workloads, it is really important to disable or configure properly the `rebootmgr` service to avoid the reboot of the nodes in case of updates scheduled by the system, to avoid any impact on the services running on the nodes. [NOTE] @@ -106,3 +108,22 @@ rebootmgrctl strategy off ==== This configuration to set the `rebootmgr` strategy can be automated using the directed network provisioning workflow. For more information, check the <>. ==== + +* `transactional-update` is a service that allows automatic updates controlled by the system. For Telco workloads, it is important to disable the automatic updates to avoid any impact on the services running on the nodes. + +To disable the automatic updates, you can run: + +[,shell] +---- +systemctl --now disable transactional-update.timer +systemctl --now disable transactional-update-cleanup.timer +---- + +* `fstrim` is a service that allows to trim the filesystems automatically every week. For Telco workloads, it is important to disable the automatic trim to avoid any impact on the services running on the nodes. + +To disable the automatic trim, you can run: + +[,shell] +---- +systemctl --now disable fstrim.timer +----