Skip to content

Commit

Permalink
node: 2625: add content for GA graduation
Browse files Browse the repository at this point in the history
File missing content

Signed-off-by: Francesco Romani <[email protected]>
  • Loading branch information
ffromani committed Jan 30, 2025
1 parent 5f27b1f commit 5bde288
Showing 1 changed file with 27 additions and 23 deletions.
50 changes: 27 additions & 23 deletions keps/sig-node/2625-cpumanager-policies-thread-placement/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,8 @@ to consider thread-level allocation, to avoid physical CPU sharing and prevent p

### Non-Goals

TBD
* Add new cpumanager policies. The community feedback and the conversation we gathered when proposing this KEP were all in favor
of adding options to fine-tune the behavior of the static policy rather than adding new policies.

## Proposal

Expand Down Expand Up @@ -206,18 +207,19 @@ to implement this enhancement.

##### Prerequisite testing updates

TBD

##### Unit tests

- `<package>`: `<date>` - `<test coverage>`
- `k8s.io/kubernetes/pkg/kubelet/cm/cpumanager`: `20250130` - 85.6% of statements
- `k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/state`: `20250130` - 88.1% of statements
- `k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/topology`: `20250130` - 85.0% of statements

##### Integration tests

- <test>: <link to test coverage>
- kubelet features don't have usually integration tests. We use a combination of unit tests and `e2e_node` tests.

##### e2e tests

TBD
- <test>: <link to test coverage>

### Test Plan
Expand Down Expand Up @@ -248,7 +250,7 @@ NOTE: Even though the feature gate is enabled by default the user still has to e
The alpha-quality options are hidden by default and only if the `CPUManagerPolicyAlphaOptions` feature gate is enabled the user has the ability to use them.
The beta-quality options are visible by default, and the feature gate allows a positive acknowledgement that non stable features are being used, and also allows to optionally turn them off.
Based on the graduation criteria described below, a policy option will graduate from a group to the other (alpha to beta).
We plan to removete the `CPUManagerPolicyAlphaOptions` and `CPUManagerPolicyBetaOptions` after all options graduated to stable, after a feature cycle passes without new planned options, and not before 1.28, to give ample time to the work in progress option to graduate at least to beta.
We plan to remove the `CPUManagerPolicyAlphaOptions` and `CPUManagerPolicyBetaOptions` after all options graduated to stable, after a feature cycle passes without new planned options, and not before 1.28, to give ample time to the work in progress option to graduate at least to beta.
- Since the feature that allows the ability to customize the behaviour of CPUManager static policy as well as the CPUManager Policy option `full-pcpus-only` were both introduced in 1.22 release and meet the above graduation criterion, `full-pcpus-only` would be considered as a non-hidden option i.e. available to be used when explicitly used along with `CPUManagerPolicyOptions` Kubelet flag in the kubelet config or command line argument called `cpumanager-policy-options` .
- The introduction of this new feature gate gives us the ability to move the feature to beta and later stable without implying all that the options are beta or stable.

Expand Down Expand Up @@ -317,48 +319,50 @@ Kubelet may fail to start. The kubelet may crash.

###### What specific metrics should inform a rollback?

The number of pod ending up in Failed for SMTAlignmentError could be used to decide a rollback.
We can use `cpu_manager_pinning_errors_total` to see all the allocation errors, irrispective of the specific reason though.
In addition, we can use the logs: the number of pod ending up in Failed for SMTAlignmentError could be used to decide a rollback.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Not Applicable.

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

### Monitoring requirements

###### How can an operator determine if the feature is in use by workloads?

- Check the metric `container_aligned_compute_resources_count` with the label `boundary=physical_cpu`
- Inspect the kubelet configuration of the nodes: check feature gates and usage of the new options

###### How can someone using this feature know that it is working for their instance?

- [ ] Events
- Event Reason:
- [ ] API .status
- Condition name:
- Other field:
- [ ] Other (treat as last resort)
- [X] Other (treat as last resort)
- Details:
- check metrics and their interplay:
* the metric `container_aligned_compute_resources_count` with the label `boundary=physical_cpu`
* the metric `cpu_manager_pinning_requests_total`
* the metric `cpu_manager_pinning_errors_total`

###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

N/A.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

- [ ] Metrics
- [X] Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- [ ] Other (treat as last resort)
- Details:
* the metric `container_aligned_compute_resources_count` with the label `boundary=physical_cpu`
* the metric `cpu_manager_pinning_requests_total`
* the metric `cpu_manager_pinning_errors_total`
- Components exposing the metric: kubelet

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

TBD

We can detail the pinning errors total with a new metric like `cpu_manager_errors_count` or
`container_aligned_compute_resources_failure_count` using the same labels as we use for `container_aligned_compute_resources_count`

### Dependencies

Expand Down Expand Up @@ -394,7 +398,7 @@ No.

###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

TBD
No.

### Troubleshooting

Expand All @@ -404,11 +408,11 @@ No effect.

###### What are other known failure modes?

No known failure mode. (TBD)
Allocation failures can lead to workload not going running. The only remediation is to disable the features and restart the kubelets.

###### What steps should be taken if SLOs are not being met to determine the problem?

N/A (TBD)
Inspect the metrics and possibly the logs to learn the failure reason

[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
Expand Down

0 comments on commit 5bde288

Please sign in to comment.