node: 2625: add content for GA graduation

File missing content Signed-off-by: Francesco Romani <[email protected]>
kubernetes · Jan 30, 2025 · 5bde288 · 5bde288
1 parent 5f27b1f
commit 5bde288
Showing 1 changed file with 27 additions and 23 deletions.
diff --git a/keps/sig-node/2625-cpumanager-policies-thread-placement/README.md b/keps/sig-node/2625-cpumanager-policies-thread-placement/README.md
@@ -94,7 +94,8 @@ to consider thread-level allocation, to avoid physical CPU sharing and prevent p
 
 ### Non-Goals
 
-TBD
+* Add new cpumanager policies. The community feedback and the conversation we gathered when proposing this KEP were all in favor
+  of adding options to fine-tune the behavior of the static policy rather than adding new policies.
 
 ## Proposal
 
@@ -206,18 +207,19 @@ to implement this enhancement.
 
 ##### Prerequisite testing updates
 
-TBD
-
 ##### Unit tests
 
-- `<package>`: `<date>` - `<test coverage>`
+- `k8s.io/kubernetes/pkg/kubelet/cm/cpumanager`:            `20250130` - 85.6% of statements
+- `k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/state`:      `20250130` - 88.1% of statements
+- `k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/topology`:   `20250130` - 85.0% of statements
 
 ##### Integration tests
 
-- <test>: <link to test coverage>
+- kubelet features don't have usually integration tests. We use a combination of unit tests and `e2e_node` tests.
 
 ##### e2e tests
 
+TBD
 - <test>: <link to test coverage>
 
 ### Test Plan
@@ -248,7 +250,7 @@ NOTE: Even though the feature gate is enabled by default the user still has to e
   The alpha-quality options are hidden by default and only if the `CPUManagerPolicyAlphaOptions` feature gate is enabled the user has the ability to use them.
   The beta-quality options are visible by default, and the feature gate allows a positive acknowledgement that non stable features are being used, and also allows to optionally turn them off.
   Based on the graduation criteria described below, a policy option will graduate from a group to the other (alpha to beta).
-  We plan to removete the `CPUManagerPolicyAlphaOptions` and `CPUManagerPolicyBetaOptions` after all options graduated to stable, after a feature cycle passes without new planned options, and not before 1.28, to give ample time to the work in progress option to graduate at least to beta.
+  We plan to remove the `CPUManagerPolicyAlphaOptions` and `CPUManagerPolicyBetaOptions` after all options graduated to stable, after a feature cycle passes without new planned options, and not before 1.28, to give ample time to the work in progress option to graduate at least to beta.
 - Since the feature that allows the ability to customize the behaviour of CPUManager static policy as well as the CPUManager Policy option `full-pcpus-only` were both introduced in 1.22 release and meet the above graduation criterion, `full-pcpus-only` would be considered as a non-hidden option i.e. available to be used when explicitly used along with `CPUManagerPolicyOptions` Kubelet flag in the kubelet config or command line argument called `cpumanager-policy-options` .
 -  The introduction of this new feature gate gives us the ability to move the feature to beta and later stable without implying all that the options are beta or stable.
 
@@ -317,48 +319,50 @@ Kubelet may fail to start. The kubelet may crash.
 
 ###### What specific metrics should inform a rollback?
 
-The number of pod ending up in Failed for SMTAlignmentError could be used to decide a rollback.
+We can use `cpu_manager_pinning_errors_total` to see all the allocation errors, irrispective of the specific reason though.
+In addition, we can use the logs: the number of pod ending up in Failed for SMTAlignmentError could be used to decide a rollback.
 
 ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
 
 Not Applicable.
 
 ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
+
 No.
 
 ### Monitoring requirements
 
 ###### How can an operator determine if the feature is in use by workloads?
 
+- Check the metric `container_aligned_compute_resources_count` with the label `boundary=physical_cpu`
 - Inspect the kubelet configuration of the nodes: check feature gates and usage of the new options
 
 ###### How can someone using this feature know that it is working for their instance?
 
-- [ ] Events
-  - Event Reason:
-- [ ] API .status
-  - Condition name:
-  - Other field:
-- [ ] Other (treat as last resort)
+- [X] Other (treat as last resort)
   - Details:
+    - check metrics and their interplay:
+        * the metric `container_aligned_compute_resources_count` with the label `boundary=physical_cpu`
+        * the metric `cpu_manager_pinning_requests_total`
+        * the metric `cpu_manager_pinning_errors_total`
 
 ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
 
 N/A.
 
 ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
 
-- [ ] Metrics
+- [X] Metrics
   - Metric name:
-  - [Optional] Aggregation method:
-  - Components exposing the metric:
-- [ ] Other (treat as last resort)
-  - Details:
+    * the metric `container_aligned_compute_resources_count` with the label `boundary=physical_cpu`
+        * the metric `cpu_manager_pinning_requests_total`
+        * the metric `cpu_manager_pinning_errors_total`
+  - Components exposing the metric: kubelet
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
 
-TBD
-
+We can detail the pinning errors total with a new metric like `cpu_manager_errors_count` or
+`container_aligned_compute_resources_failure_count` using the same labels as we use for `container_aligned_compute_resources_count`
 
 ### Dependencies
 
@@ -394,7 +398,7 @@ No.
 
 ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
 
-TBD
+No.
 
 ### Troubleshooting
 
@@ -404,11 +408,11 @@ No effect.
 
 ###### What are other known failure modes?
 
-No known failure mode. (TBD)
+Allocation failures can lead to workload not going running. The only remediation is to disable the features and restart the kubelets.
 
 ###### What steps should be taken if SLOs are not being met to determine the problem?
 
-N/A (TBD)
+Inspect the metrics and possibly the logs to learn the failure reason
 
 [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
 [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos