Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node densityn ignore iterations #60

Merged
merged 1 commit into from
Aug 15, 2024

Conversation

paigerube14
Copy link
Collaborator

Type of change

  • Refactor
  • New feature
  • Bug fix
  • Optimization
  • Documentation Update

Description

This change will allow us to match uuids no matter there jobIterations. This is important for node-density because the number of iterations is calculated based on how many nodes are currently on the cluster and this number can fluctuate.

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.

Testing

The old way of running produces less results than the new flag

orion cmd --config examples/trt-external-payload-node-density.yaml --lookback 5d --hunter-analyze

,uuid,timestamp,podReadyLatency_P99,apiserverCPU_avg,ovnCPU_avg,etcdCPU_avg,kubelet_avg,buildUrl
0,0d472812-58c4-4d17-9241-595e54e789a1,2024-08-04T05:20:49.559886961Z,3000,8.346889860243925,2.4993164237465395,8.512982301910718,24.588333408037823,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1819941002970927104
1,73a5ddf9-1ad9-463a-866c-0bb7ee33fa2d,2024-08-06T09:53:53.960806551Z,2000,9.917386528040515,2.9165261534623683,8.232103824615479,23.14861095448335,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1820732996609642496
2,aba6b499-1e6b-455b-94bf-f1f360387c25,2024-08-06T15:42:13.289451287Z,3000,8.68606343644766,2.8989181143536698,7.900732731377637,24.709259218639797,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1820818543520780288
3,707cf66f-c7de-473d-9155-a35fb81b6b44,2024-08-06T19:51:56.020648472Z,3000,9.36453001043119,2.842460772863333,8.242996867332193,24.676388904452324,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1820883258565464064
4,6d99bef0-c342-4ccb-9d04-2ef7a54c479f,2024-08-07T01:54:37.370508933Z,2000,8.907801432446286,2.562429538004141,7.662414407150613,23.09930546581745,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1820972948597510144
5,e212ce7c-ddf2-4967-89d4-52a637f092ff,2024-08-07T14:43:08.617067442Z,3000,8.03217322994955,2.490557001078843,8.363801335295042,25.229444408416747,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1821167776920768512

orion cmd --config examples/trt-external-payload-node-density.yaml --lookback 5d --hunter-analyze --node-count True

,uuid,timestamp,podReadyLatency_P99,apiserverCPU_avg,ovnCPU_avg,etcdCPU_avg,kubelet_avg,buildUrl
0,a65533b2-b96b-4975-a683-5c8c41317db9,2024-08-02T22:44:13.115031954Z,3000,9.05506926860005,2.6793121076059183,7.832702014181349,24.659876682140208,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1819476387262631936
1,3bf39005-d8b3-4167-a003-63c258582527,2024-08-03T03:40:00.061127136Z,2000,9.333829958437542,3.0431592265737812,8.905316504221114,25.346031677155267,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1819552804872654848
2,ca470b99-684f-4804-8040-245469d36a0a,2024-08-03T12:06:39.550167246Z,2000,10.04589718492103,2.938977014894182,8.667146306107009,22.56984125432514,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1819681099010281472
3,0d472812-58c4-4d17-9241-595e54e789a1,2024-08-04T05:20:49.559886961Z,3000,8.346889860243925,2.4993164237465395,8.512982301910718,24.588333408037823,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1819941002970927104
4,e9b93114-6cc1-412d-b628-c046e7a1d4ca,2024-08-05T08:31:46.013914888Z,3000,8.460509004894508,2.836086990957304,7.733555127349165,22.969444140791893,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1820348477922611200
5,22d07632-578d-46ac-82a2-04b2ece66283,2024-08-05T22:20:36.740563901Z,2000,7.813337978324853,2.595532922870286,8.23953664302826,24.6055556624024,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1820557122354548736
6,c2cf30e0-ab8c-4714-b78a-1645cdfa411e,2024-08-06T05:35:24.554795997Z,2000,9.24912150722273,2.582041348531111,8.039307292964724,24.625925969194483,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1820667850361147392
7,73a5ddf9-1ad9-463a-866c-0bb7ee33fa2d,2024-08-06T09:53:53.960806551Z,2000,9.917386528040515,2.9165261534623683,8.232103824615479,23.14861095448335,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1820732996609642496
8,aba6b499-1e6b-455b-94bf-f1f360387c25,2024-08-06T15:42:13.289451287Z,3000,8.68606343644766,2.8989181143536698,7.900732731377637,24.709259218639797,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1820818543520780288
9,707cf66f-c7de-473d-9155-a35fb81b6b44,2024-08-06T19:51:56.020648472Z,3000,9.36453001043119,2.842460772863333,8.242996867332193,24.676388904452324,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1820883258565464064
10,6d99bef0-c342-4ccb-9d04-2ef7a54c479f,2024-08-07T01:54:37.370508933Z,2000,8.907801432446286,2.562429538004141,7.662414407150613,23.09930546581745,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1820972948597510144
11,2cbbe23a-bfca-4b1c-831f-1402fa443eb5,2024-08-07T06:29:02.30197508Z,2000,8.41378196755131,2.6346652609732533,8.286598180915103,23.116049263212417,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1821044084718964736
12,e212ce7c-ddf2-4967-89d4-52a637f092ff,2024-08-07T14:43:08.617067442Z,3000,8.03217322994955,2.490557001078843,8.363801335295042,25.229444408416747,https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.17-nightly-x86-payload-control-plane-6nodes/1821167776920768512

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

Signed-off-by: Auto User <[email protected]>
@jtaleric
Copy link
Member

jtaleric commented Aug 7, 2024

ah! Nice! yeah this is something at @shashank-boyapally and I have spoken about in the past given number of pods can change based on what is installed, so iterations might (very likely) drift a bit. As long as the node count is the same, we should be somewhere in the ballpark!

@jtaleric
Copy link
Member

jtaleric commented Aug 7, 2024

@vishnuchalla can you speak to the failures here? not familiar with what/why this is broken?

@vishnuchalla
Copy link
Collaborator

@vishnuchalla can you speak to the failures here? not familiar with what/why this is broken?

We will need to get the ES secret updated and re-trigger the tests. Will keep you posted on it.

@vishnuchalla
Copy link
Collaborator

vishnuchalla commented Aug 8, 2024

ah! Nice! yeah this is something at @shashank-boyapally and I have spoken about in the past given number of pods can change based on what is installed, so iterations might (very likely) drift a bit. As long as the node count is the same, we should be somewhere in the ballpark!

This might take use somewhere close to the solution, but we will only get exact apples-to-apples runs once we start filtering based on iterations that are being calculated in the runtime. In order to achieve that I remember we discussing to populate iterations count in the index.sh script previously.

@jtaleric
Copy link
Member

jtaleric commented Aug 8, 2024

ah! Nice! yeah this is something at @shashank-boyapally and I have spoken about in the past given number of pods can change based on what is installed, so iterations might (very likely) drift a bit. As long as the node count is the same, we should be somewhere in the ballpark!

This might take use somewhere close to the solution, but we will only get exact apples-to-apples runs once we start filtering based on iterations that are being calculated in the runtime. In order to achieve that I remember we discussing to populate iterations count in the index.sh script previously.

Well, even populating in the perf_scale_ci index won't address this problem.

For example, there could be situations where we are +/- 1 pod off (even with the same version,platform,etc), so the iteration calculation for node-density workloads would differ slightly. This change would allow us to just say, match runs with the same node-count... Correct me if I am off here @paigerube14

@vishnuchalla
Copy link
Collaborator

ah! Nice! yeah this is something at @shashank-boyapally and I have spoken about in the past given number of pods can change based on what is installed, so iterations might (very likely) drift a bit. As long as the node count is the same, we should be somewhere in the ballpark!

This might take use somewhere close to the solution, but we will only get exact apples-to-apples runs once we start filtering based on iterations that are being calculated in the runtime. In order to achieve that I remember we discussing to populate iterations count in the index.sh script previously.

Well, even populating in the perf_scale_ci index won't address this problem.

For example, there could be situations where we are +/- 1 pod off (even with the same version,platform,etc), so the iteration calculation for node-density workloads would differ slightly. This change would allow us to just say, match runs with the same node-count... Correct me if I am off here @paigerube14

If our idea was to filter out runs based on number of nodes, then I am good with it. My only question was though we have same number of nodes, there can be difference in number of iterations being calculated in the runtime (.i.e especially for node-density runs). So can we consider runs with same number of nodes to be apples to apples comparison even if they are executed with different number of iterations? OR do we want to take code changes to match iteration in a follow up PR?
cc: @afcollins @dry923 @rsevilla87

@vishnuchalla
Copy link
Collaborator

vishnuchalla commented Aug 8, 2024

@vishnuchalla can you speak to the failures here? not familiar with what/why this is broken?

We will need to get the ES secret updated and re-trigger the tests. Will keep you posted on it.

After conducting a set of experiments, noticed that secrets do not get propagated into forks and aren't available during pull requests. They only work against the main repo, for example. So I am going to change the CI tests workflow to on push. Still we will be able to detect failures in a PR but only after its merge into main, if that sounds good.

Fix PR: #64

@jtaleric
Copy link
Member

lgtm

Copy link
Collaborator

@vishnuchalla vishnuchalla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@shashank-boyapally shashank-boyapally left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the add, lgtm!

@vishnuchalla vishnuchalla merged commit 224bc3e into cloud-bulldozer:main Aug 15, 2024
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants