Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/pdb: adding PDB to server and high priorityClass for jobs/plugins #100

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mtulio
Copy link
Contributor

@mtulio mtulio commented Mar 8, 2024

  • run cleanup: Isolating pre runs for check and setups
  • creating PDB for sonobuoy server
  • setting high priority class for jobs/plugins

TODO:

  • Find/assign Jira card

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 8, 2024
@mtulio mtulio marked this pull request as ready for review March 8, 2024 01:26
Copy link

openshift-ci bot commented Mar 8, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from mtulio. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 8, 2024
@mtulio mtulio force-pushed the feat-pdb branch 3 times, most recently from 1ea5706 to 1233944 Compare March 8, 2024 01:33
@mtulio
Copy link
Contributor Author

mtulio commented Mar 8, 2024

Hold until assigning jira card.
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 8, 2024
@mtulio mtulio added kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Mar 8, 2024
@mtulio
Copy link
Contributor Author

mtulio commented Mar 8, 2024

/assign @jcpowermac @rvanderp3

@mtulio
Copy link
Contributor Author

mtulio commented Mar 8, 2024

$ oc get pdb -n opct
NAME          MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
opct-server   1               N/A               0                     26m

$ oc get all -n opct
Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
NAME                                                                   READY   STATUS      RESTARTS   AGE
pod/sonobuoy                                                           1/1     Running     0          26m
pod/sonobuoy-05-openshift-cluster-upgrade-job-f4de252998ce49f6         0/3     Completed   0          26m
pod/sonobuoy-10-openshift-kube-conformance-job-35c02a39d7564221        3/3     Running     0          26m
pod/sonobuoy-20-openshift-conformance-validated-job-fd0d9f98dfef4699   3/3     Running     0          26m
pod/sonobuoy-80-openshift-tests-replay-job-c883e974e4e54fb4            3/3     Running     0          26m
pod/sonobuoy-99-openshift-artifacts-collector-job-27488b1ca3624561     3/3     Running     0          26m

NAME                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/sonobuoy-aggregator   ClusterIP   172.30.12.181   <none>        8080/TCP   26m

@jcpowermac
Copy link
Collaborator

@mtulio the check for the taint would need to change, would you do that in this pr or follow up?

tolerations, err := json.Marshal([]v1.Toleration{{

Decopling pre-run functions for check and setup for `run` command.

Implementing PDB in the pre-run setup to protect the sonobuoy server
for disruptions.

Setting priority classes for plugin/jobs.

Making the dedicated node and taints optional using CLI flags.
@mtulio
Copy link
Contributor Author

mtulio commented Mar 8, 2024

@mtulio the check for the taint would need to change, would you do that in this pr or follow up?

tolerations, err := json.Marshal([]v1.Toleration{{

@jcpowermac what about making dedicated (node selector) and taints optional being able to enable using flags?

(please take a look at the latest version)

@mtulio
Copy link
Contributor Author

mtulio commented Mar 9, 2024

This solutions seems not to be fully healthy yet. When running the new version (no taints, no selectors) I got the conformance test pod evicted, making the CLI stuck in the progress/running (another issue/improvement to check the pod):

Fri, 08 Mar 2024 21:07:14 -03|1h21m21.753206165s> Global Status: running
JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
05-openshift-cluster-upgrade       | complete   |            | 0/0 (0 failures)          | complete                                          
10-openshift-kube-conformance      | running    |            | 35/382 (0 failures)       | status=running=T/C/P/F/S=382/35/35/0/0            
20-openshift-conformance-validated | running    |            | 0/3817 (0 failures)       | status=waiting-for=10-openshift-kube-conformance=(0/-347/0)=[6/1080]
80-openshift-tests-replay          | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/-3817/0)=[0/1080]
99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=80-openshift-tests-replay=(0/0/0)=[0/1080]

$ oc get pods -n opct 
NAME                                                             READY   STATUS    RESTARTS   AGE
sonobuoy                                                         1/1     Running   0          84m
sonobuoy-80-openshift-tests-replay-job-f82c92f0b52c4d7d          3/3     Running   0          84m
sonobuoy-99-openshift-artifacts-collector-job-ed24cc9b0aa94903   3/3     Running   0          84m


$ oc get events -n opct | grep -i evic
77m         Normal   TaintManagerEviction   pod/sonobuoy-05-openshift-cluster-upgrade-job-2566af3ddb8440df         Marking for deletion Pod opct/sonobuoy-05-openshift-cluster-upgrade-job-2566af3ddb8440df
77m         Normal   TaintManagerEviction   pod/sonobuoy-05-openshift-cluster-upgrade-job-2566af3ddb8440df         Cancelling deletion of Pod opct/sonobuoy-05-openshift-cluster-upgrade-job-2566af3ddb8440df
77m         Normal   TaintManagerEviction   pod/sonobuoy-10-openshift-kube-conformance-job-cce62b7d11974bf2        Marking for deletion Pod opct/sonobuoy-10-openshift-kube-conformance-job-cce62b7d11974bf2
77m         Normal   TaintManagerEviction   pod/sonobuoy-10-openshift-kube-conformance-job-cce62b7d11974bf2        Cancelling deletion of Pod opct/sonobuoy-10-openshift-kube-conformance-job-cce62b7d11974bf2
77m         Normal   TaintManagerEviction   pod/sonobuoy-20-openshift-conformance-validated-job-80cea76bac2d4218   Marking for deletion Pod opct/sonobuoy-20-openshift-conformance-validated-job-80cea76bac2d4218
77m         Normal   TaintManagerEviction   pod/sonobuoy-20-openshift-conformance-validated-job-80cea76bac2d4218   Cancelling deletion of Pod opct/sonobuoy-20-openshift-conformance-validated-job-80cea76bac2d4218

$ oc get pod sonobuoy-99-openshift-artifacts-collector-job-ed24cc9b0aa94903 -n opct -o yaml |grep -i prio
  preemptionPolicy: PreemptLowerPriority
  priority: 2000001000
  priorityClassName: system-node-critical

$ oc get ns opct -o yaml | yq4 ea .metadata.annotations
openshift.io/sa.scc.mcs: s0:c26,c25
openshift.io/sa.scc.supplemental-groups: 1000700000/10000
openshift.io/sa.scc.uid-range: 1000700000/10000

@mtulio
Copy link
Contributor Author

mtulio commented Mar 15, 2024

The higher priority class for jobs managed by Sonobuoy seems not to be enough to keep the environment stable, in a regular deployment (w/o taints and node selector), we keep seeing (eventual) job pods eviction, and execution getting stuck in 3x3 deployment.

  • Provider/Platform type: AWS/AWS 4.15, lost the kube conformance pod
Fri, 15 Mar 2024 16:54:14 -03|12.989463236s> Global Status: running
JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
05-openshift-cluster-upgrade       | complete   |            | 0/0 (0 failures)          | complete                                          
10-openshift-kube-conformance      | running    |            | 280/390 (0 failures)      | status=running=T/C/P/F/S=390/280/280/0/0          
20-openshift-conformance-validated | running    |            | 0/0 (0 failures)          | status=waiting-for=10-openshift-kube-conformance=(0/-110/0)=[149/1080]
80-openshift-tests-replay          | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/0/0)=[277/1080]
99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=80-openshift-tests-replay=(0/0/0)=[0/1080]
Fri, 15 Mar 2024 16:54:24 -03|23.023669585s> Global Status: running
JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
05-openshift-cluster-upgrade       | complete   |            | 0/0 (0 failures)          | complete                                          
10-openshift-kube-conformance      | running    |            | 280/390 (0 failures)      | status=running=T/C/P/F/S=390/280/280/0/0          
20-openshift-conformance-validated | running    |            | 0/0 (0 failures)          | status=waiting-for=10-openshift-kube-conformance=(0/-110/0)=[150/1080]
80-openshift-tests-replay          | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/0/0)=[278/1080]
99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=80-openshift-tests-replay=(0/0/0)=[0/1080]


$ oc get pods -n opct
NAME                                                               READY   STATUS    RESTARTS   AGE
sonobuoy                                                           1/1     Running   0          49m
sonobuoy-20-openshift-conformance-validated-job-6860eaa52f0d4525   3/3     Running   0          49m
sonobuoy-80-openshift-tests-replay-job-5f781686c73348a7            3/3     Running   0          49m
sonobuoy-99-openshift-artifacts-collector-job-f3a9e55d60b94f97     2/2     Running   0          49m

$ oc get nodes
NAME                          STATUS   ROLES                  AGE   VERSION
ip-10-0-11-160.ec2.internal   Ready    worker                 61m   v1.29.1+5a4819c
ip-10-0-114-74.ec2.internal   Ready    edge,worker            58m   v1.29.1+5a4819c
ip-10-0-25-120.ec2.internal   Ready    control-plane,master   71m   v1.29.1+5a4819c
ip-10-0-46-193.ec2.internal   Ready    worker                 61m   v1.29.1+5a4819c
ip-10-0-57-199.ec2.internal   Ready    control-plane,master   71m   v1.29.1+5a4819c
ip-10-0-72-216.ec2.internal   Ready    worker                 61m   v1.29.1+5a4819c
ip-10-0-84-190.ec2.internal   Ready    control-plane,master   71m   v1.29.1+5a4819c


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants