OCPBUGS-35911: E2E: Add test to verify runc process excludes the cpus used by pod. #1088

SargunNarula · 2024-06-21T10:06:02Z

Adding a test to verify that runc does not use CPUs assigned to guaranteed pods.

Original bug link - https://bugzilla.redhat.com/show_bug.cgi?id=1910386

openshift-ci-robot · 2024-06-21T10:06:06Z

@SargunNarula: This pull request explicitly references no jira issue.

In response to this:

Adding a test to verify that runc does not use CPUs assigned to guaranteed pods.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2024-06-21T10:09:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SargunNarula
Once this PR has been reviewed and has the lgtm label, please assign marsik for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-06-21T11:46:15Z

@SargunNarula: This pull request references Jira Issue OCPBUGS-35911, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.17.0) matches configured target version for branch (4.17.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Adding a test to verify that runc does not use CPUs assigned to guaranteed pods.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-06-21T11:46:49Z

@SargunNarula: This pull request references Jira Issue OCPBUGS-35911, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.17.0) matches configured target version for branch (4.17.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

Adding a test to verify that runc does not use CPUs assigned to guaranteed pods.

Original bug link - https://bugzilla.redhat.com/show_bug.cgi?id=1910386

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

shajmakh

Thanks for this. I added few initial comments below

shajmakh · 2024-06-26T12:30:43Z

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

+
+		AfterEach(func() {
+			cmd := []string{"rm", "-f", "/rootfs/var/roothome/create"}
+			nodes.ExecCommand(ctx, workerRTNode, cmd)


let's add error handling here please

The test fails at this line with context deadline error, every time Expect error is added. I have manually verified that If expect is removed, the rm task gets completed at the system. I tried using custom context with context.WithDeadline & context.WithTimeout but still the test failed. This happens at the next error handling part as well.

I wouldn't completely ignore the error though, a warning report or a klog.Error is fine.
also we can make sure that after this command is executed the file was indeed deleted as a workaround for handling the error

+1 to what Shereen wrote

I have removed the intermediate file logic, hence this command is not required. On the other places added the error handling.

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

shajmakh · 2024-06-26T12:55:56Z

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

+			for _, containerID := range containerIDs {
+				path := fmt.Sprintf("/rootfs/var/run/containers/storage/overlay-containers/%s/userdata/config.json", containerID)
+				cmd := []string{"/bin/bash", "-c", fmt.Sprintf("cat %s >> /rootfs/var/roothome/create", path)}
+				nodes.ExecCommand(ctx, workerRTNode, cmd)


let's please add error handling in all places

Added a comment above regarding facing context deadline error, it is applicable here too.

With new logic, file handling is not performed. Yet error handling is added at places required.

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

mrniranjan · 2024-09-05T14:50:38Z

looks good to me from my side.

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

shajmakh · 2024-09-24T12:25:51Z

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

+			deleteTestPod(ctx, bestEffortPod)
+		})
+
+		It("[test_id: 74461] Verify that runc excludes the cpus used by guaranteed pod", func() {


let's please add By() and logs where it fits to help view the flow of the test when debugging

ffromani

the testing logic can be maybe simplified, but no major objections it seems
questions and possible improvements inside

ffromani · 2024-09-25T09:05:12Z

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

+			Expect(err).ToNot(HaveOccurred())
+			for _, containerID := range containerIDs {
+				path := fmt.Sprintf("/rootfs/var/run/containers/storage/overlay-containers/%s/userdata/config.json", containerID)
+				cmd := []string{"/bin/bash", "-c", fmt.Sprintf("cat %s >> /rootfs/var/roothome/create", path)}


why do we need to copy the data into an intermediate file and can't we read it directly and store in a variable in this test?
if we need an itermediate file, since we go in append mode, let's make sure the file is actually empty before the test starts. If because a bug the file is not created properly, we will have corrupted state very hard to notice.
Another approach (possibly better) is to create a temp file for this test.

I think the simpler solution is to just avoid the intermediate file

Thanks @ffromani , for the simpler solution advice. Revised with latest commit.

ffromani · 2024-09-25T09:06:22Z

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

 })

+func getPod(ctx context.Context, workerRTNode *corev1.Node, guaranteed bool) (*corev1.Pod, error) {


if this can't fail , let's just omit the error return value.
I'd call it makePod rather than getPod

Revised with latest commit.

ffromani · 2024-09-25T09:07:12Z

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

+			overlapFound := !guaranteedPodCpus.Intersection(runcCpus).IsEmpty()
+			Expect(overlapFound).ToNot(BeTrue(), fmt.Sprintf("Overlap found between guaranteedPod cpus (%s) and runtime Cpus (%s), not expected behaviour", guaranteedPodCpus, runcCpus))


or

overlap := guaranteedPodCpus.Intersection(runcCpus).List() Expect(overlap).To(BeEmpty(), "Overlap found between guaranteedPod cpus (%s) and runtime Cpus (%s), not expected behaviour", guaranteedPodCpus, runcCpus)

This is a much better way, Thanks. Revised.

ffromani · 2024-09-25T09:07:47Z

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

+
+		AfterEach(func() {
+			cmd := []string{"rm", "-f", "/rootfs/var/roothome/create"}
+			nodes.ExecCommand(ctx, workerRTNode, cmd)


+1 to what Shereen wrote

Adding a test to verify that runc does not use CPUs assigned to guaranteed pods. Signed-off-by: Sargun Narula <[email protected]>

shajmakh · 2024-09-27T12:25:01Z

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

+			Expect(len(hostnameMatches)).ToNot(Equal(0), "Failed to extract hostname information")
+			Expect(len(cpusMatches)).ToNot(Equal(0), "Failed to extract cpus information")


nit: Expect().ToNot(Equal(0)) = Expect().ToNot(BeEmpty())

shajmakh · 2024-09-27T12:38:05Z

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

+			bestEffortPod = makePod(ctx, workerRTNode, false)
+			err = testclient.Client.Create(ctx, bestEffortPod)


how do we guarantee that the pod will be created with a different name from the GU pod?

shajmakh · 2024-09-27T12:39:04Z

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

+			_, err = pods.WaitForCondition(ctx, client.ObjectKeyFromObject(guaranteedPod), corev1.PodReady, corev1.ConditionTrue, 5*time.Minute)
+			Expect(err).ToNot(HaveOccurred(), "Guaranteed pod did not become ready in time")
+			Expect(guaranteedPod.Status.QOSClass).To(Equal(corev1.PodQOSGuaranteed), "Guaranteed pod does not have the correct QOSClass")
+


an info report would be helpful here, something like:
klog.InfoS("<QoS> pod %s/%s was successfully created", updatedPodObj.Namespace,updatedPodObj.Name)
same for the best-effort

shajmakh · 2024-09-27T12:43:17Z

test/e2e/performanceprofile/functests/1_performance/cpu_management.go

+		AfterEach(func() {
+			By("Cleaning up pods")
+			deleteTestPod(ctx, guaranteedPod)
+			deleteTestPod(ctx, bestEffortPod)
+		})


you can avoid after each because the creation of the pods is done in the specific It(), you can convert that to defer after each pods creation (as close as possible to the creation):

guaranteedPod = makePod(ctx, workerRTNode, true) err := testclient.Client.Create(ctx, guaranteedPod) Expect(err).ToNot(HaveOccurred(), "Failed to create guaranteed pod") defer deleteTestPod(ctx, guaranteedPod)

ffromani

I'm not sure this tests checks the correct thing. We do check a BE pod has no overlap with CPUs exclusively assigned to a Guaranteed pod, but the problem here is not what happens at runtime, but what happened at pod creation time. Once the pod goes running, runc is terminated, and there's no trace of where did it run

openshift-ci · 2024-09-27T15:38:41Z

@SargunNarula: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

SargunNarula · 2024-11-22T13:57:14Z

@ffromani The original issue identified was that when launching a guaranteed pod running a cyclic test, the runc container creation process was observed to be running on isolated CPUs. This process inadvertently utilized the CPUs allocated to the cyclic test.

The resolution involved ensuring that the cpuset.cpus configuration is passed during container creation.

Additionally, since runc follows a two-step creation process, the initialization process (executed as /usr/bin/pod, which is a symlink to /usr/bin/runc) is started within a container. This container is assigned the cpuset.cpus values. This behavior can be confirmed by examining the config.json of the initialization container to verify that the appropriate CPU allocation is applied, reserved CPUs in the case of a guaranteed pod, or all available CPUs in the case of a Best-Effort (BE) pod.

Reference:

Based on these observations, the current patch may not effectively validate this scenario. I will work on a revised patch to accurately verify the CPUs being utilized.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 21, 2024

openshift-ci bot requested review from jmencak and swatisehgal June 21, 2024 10:08

SargunNarula changed the title ~~NO-JIRA: E2E: Add test to verify runc uses valid cpus~~ OCPBUGS-35911: E2E: Add test to verify runc uses valid cpus Jun 21, 2024

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 21, 2024

SargunNarula force-pushed the runc_cpu_isolation branch from 06dcbf0 to 33fd5ac Compare June 21, 2024 11:55

SargunNarula changed the title ~~OCPBUGS-35911: E2E: Add test to verify runc uses valid cpus~~ OCPBUGS-35911: E2E: Add test to verify runc process excludes the cpus used by pod. Jun 21, 2024

mrniranjan reviewed Jun 25, 2024

View reviewed changes

test/e2e/performanceprofile/functests/1_performance/cpu_management.go Outdated Show resolved Hide resolved

mrniranjan reviewed Jun 25, 2024

View reviewed changes

test/e2e/performanceprofile/functests/1_performance/cpu_management.go Outdated Show resolved Hide resolved

mrniranjan reviewed Jun 25, 2024

View reviewed changes

test/e2e/performanceprofile/functests/1_performance/cpu_management.go Outdated Show resolved Hide resolved

shajmakh reviewed Jun 26, 2024

View reviewed changes

SargunNarula force-pushed the runc_cpu_isolation branch from 1fa3baa to e983f94 Compare July 17, 2024 17:47

SargunNarula force-pushed the runc_cpu_isolation branch 2 times, most recently from 83fa93a to c6d89e5 Compare August 16, 2024 11:31

SargunNarula force-pushed the runc_cpu_isolation branch from c6d89e5 to 4007668 Compare September 24, 2024 10:11

shajmakh reviewed Sep 24, 2024

View reviewed changes

test/e2e/performanceprofile/functests/1_performance/cpu_management.go Outdated Show resolved Hide resolved

SargunNarula force-pushed the runc_cpu_isolation branch 2 times, most recently from 02e4f19 to 73a257b Compare September 24, 2024 11:46

shajmakh reviewed Sep 24, 2024

View reviewed changes

ffromani reviewed Sep 25, 2024

View reviewed changes

SargunNarula force-pushed the runc_cpu_isolation branch from 73a257b to 083325f Compare September 27, 2024 11:36

E2E: Add test to verify runc uses valid cpus

10d3c35

Adding a test to verify that runc does not use CPUs assigned to guaranteed pods. Signed-off-by: Sargun Narula <[email protected]>

SargunNarula force-pushed the runc_cpu_isolation branch from 083325f to 10d3c35 Compare September 27, 2024 12:02

shajmakh reviewed Sep 27, 2024

View reviewed changes

ffromani reviewed Sep 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-35911: E2E: Add test to verify runc process excludes the cpus used by pod. #1088

OCPBUGS-35911: E2E: Add test to verify runc process excludes the cpus used by pod. #1088

SargunNarula commented Jun 21, 2024 •

edited

Loading

openshift-ci-robot commented Jun 21, 2024

openshift-ci bot commented Jun 21, 2024

openshift-ci-robot commented Jun 21, 2024

openshift-ci-robot commented Jun 21, 2024

shajmakh left a comment

shajmakh Jun 26, 2024

SargunNarula Jul 17, 2024 •

edited

Loading

shajmakh Sep 24, 2024

ffromani Sep 25, 2024

SargunNarula Sep 27, 2024

shajmakh Jun 26, 2024

SargunNarula Jul 17, 2024

SargunNarula Sep 27, 2024

mrniranjan commented Sep 5, 2024

shajmakh Sep 24, 2024

ffromani left a comment

ffromani Sep 25, 2024

SargunNarula Sep 27, 2024

ffromani Sep 25, 2024

SargunNarula Sep 27, 2024

ffromani Sep 25, 2024

SargunNarula Sep 27, 2024

ffromani Sep 25, 2024

shajmakh Sep 27, 2024

shajmakh Sep 27, 2024

shajmakh Sep 27, 2024

shajmakh Sep 27, 2024

ffromani left a comment

openshift-ci bot commented Sep 27, 2024

SargunNarula commented Nov 22, 2024

		})

		func getPod(ctx context.Context, workerRTNode corev1.Node, guaranteed bool) (corev1.Pod, error) {

		overlapFound := !guaranteedPodCpus.Intersection(runcCpus).IsEmpty()
		Expect(overlapFound).ToNot(BeTrue(), fmt.Sprintf("Overlap found between guaranteedPod cpus (%s) and runtime Cpus (%s), not expected behaviour", guaranteedPodCpus, runcCpus))

		Expect(len(hostnameMatches)).ToNot(Equal(0), "Failed to extract hostname information")
		Expect(len(cpusMatches)).ToNot(Equal(0), "Failed to extract cpus information")

		bestEffortPod = makePod(ctx, workerRTNode, false)
		err = testclient.Client.Create(ctx, bestEffortPod)

OCPBUGS-35911: E2E: Add test to verify runc process excludes the cpus used by pod. #1088

Are you sure you want to change the base?

OCPBUGS-35911: E2E: Add test to verify runc process excludes the cpus used by pod. #1088

Conversation

SargunNarula commented Jun 21, 2024 • edited Loading

openshift-ci-robot commented Jun 21, 2024

openshift-ci bot commented Jun 21, 2024

openshift-ci-robot commented Jun 21, 2024

openshift-ci-robot commented Jun 21, 2024

shajmakh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SargunNarula Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrniranjan commented Sep 5, 2024

Choose a reason for hiding this comment

ffromani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Sep 27, 2024

SargunNarula commented Nov 22, 2024

SargunNarula commented Jun 21, 2024 •

edited

Loading

SargunNarula Jul 17, 2024 •

edited

Loading