-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-35911: E2E: Add test to verify runc process excludes the cpus used by pod. #1088
base: master
Are you sure you want to change the base?
Conversation
@SargunNarula: This pull request explicitly references no jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: SargunNarula The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@SargunNarula: This pull request references Jira Issue OCPBUGS-35911, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@SargunNarula: This pull request references Jira Issue OCPBUGS-35911, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
06dcbf0
to
33fd5ac
Compare
test/e2e/performanceprofile/functests/1_performance/cpu_management.go
Outdated
Show resolved
Hide resolved
test/e2e/performanceprofile/functests/1_performance/cpu_management.go
Outdated
Show resolved
Hide resolved
test/e2e/performanceprofile/functests/1_performance/cpu_management.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this. I added few initial comments below
|
||
AfterEach(func() { | ||
cmd := []string{"rm", "-f", "/rootfs/var/roothome/create"} | ||
nodes.ExecCommand(ctx, workerRTNode, cmd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add error handling here please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test fails at this line with context deadline error, every time Expect error is added. I have manually verified that If expect is removed, the rm task gets completed at the system. I tried using custom context with context.WithDeadline & context.WithTimeout but still the test failed. This happens at the next error handling part as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't completely ignore the error though, a warning report or a klog.Error is fine.
also we can make sure that after this command is executed the file was indeed deleted as a workaround for handling the error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to what Shereen wrote
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the intermediate file logic, hence this command is not required. On the other places added the error handling.
test/e2e/performanceprofile/functests/1_performance/cpu_management.go
Outdated
Show resolved
Hide resolved
test/e2e/performanceprofile/functests/1_performance/cpu_management.go
Outdated
Show resolved
Hide resolved
test/e2e/performanceprofile/functests/1_performance/cpu_management.go
Outdated
Show resolved
Hide resolved
for _, containerID := range containerIDs { | ||
path := fmt.Sprintf("/rootfs/var/run/containers/storage/overlay-containers/%s/userdata/config.json", containerID) | ||
cmd := []string{"/bin/bash", "-c", fmt.Sprintf("cat %s >> /rootfs/var/roothome/create", path)} | ||
nodes.ExecCommand(ctx, workerRTNode, cmd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's please add error handling in all places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment above regarding facing context deadline error, it is applicable here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With new logic, file handling is not performed. Yet error handling is added at places required.
test/e2e/performanceprofile/functests/1_performance/cpu_management.go
Outdated
Show resolved
Hide resolved
test/e2e/performanceprofile/functests/1_performance/cpu_management.go
Outdated
Show resolved
Hide resolved
1fa3baa
to
e983f94
Compare
83fa93a
to
c6d89e5
Compare
looks good to me from my side. |
c6d89e5
to
4007668
Compare
test/e2e/performanceprofile/functests/1_performance/cpu_management.go
Outdated
Show resolved
Hide resolved
02e4f19
to
73a257b
Compare
deleteTestPod(ctx, bestEffortPod) | ||
}) | ||
|
||
It("[test_id: 74461] Verify that runc excludes the cpus used by guaranteed pod", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's please add By() and logs where it fits to help view the flow of the test when debugging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the testing logic can be maybe simplified, but no major objections it seems
questions and possible improvements inside
Expect(err).ToNot(HaveOccurred()) | ||
for _, containerID := range containerIDs { | ||
path := fmt.Sprintf("/rootfs/var/run/containers/storage/overlay-containers/%s/userdata/config.json", containerID) | ||
cmd := []string{"/bin/bash", "-c", fmt.Sprintf("cat %s >> /rootfs/var/roothome/create", path)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to copy the data into an intermediate file and can't we read it directly and store in a variable in this test?
if we need an itermediate file, since we go in append mode, let's make sure the file is actually empty before the test starts. If because a bug the file is not created properly, we will have corrupted state very hard to notice.
Another approach (possibly better) is to create a temp file for this test.
I think the simpler solution is to just avoid the intermediate file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ffromani , for the simpler solution advice. Revised with latest commit.
}) | ||
|
||
func getPod(ctx context.Context, workerRTNode *corev1.Node, guaranteed bool) (*corev1.Pod, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this can't fail , let's just omit the error return value.
I'd call it makePod
rather than getPod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised with latest commit.
overlapFound := !guaranteedPodCpus.Intersection(runcCpus).IsEmpty() | ||
Expect(overlapFound).ToNot(BeTrue(), fmt.Sprintf("Overlap found between guaranteedPod cpus (%s) and runtime Cpus (%s), not expected behaviour", guaranteedPodCpus, runcCpus)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or
overlap := guaranteedPodCpus.Intersection(runcCpus).List()
Expect(overlap).To(BeEmpty(), "Overlap found between guaranteedPod cpus (%s) and runtime Cpus (%s), not expected behaviour", guaranteedPodCpus, runcCpus)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a much better way, Thanks. Revised.
|
||
AfterEach(func() { | ||
cmd := []string{"rm", "-f", "/rootfs/var/roothome/create"} | ||
nodes.ExecCommand(ctx, workerRTNode, cmd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to what Shereen wrote
73a257b
to
083325f
Compare
Adding a test to verify that runc does not use CPUs assigned to guaranteed pods. Signed-off-by: Sargun Narula <[email protected]>
083325f
to
10d3c35
Compare
Expect(len(hostnameMatches)).ToNot(Equal(0), "Failed to extract hostname information") | ||
Expect(len(cpusMatches)).ToNot(Equal(0), "Failed to extract cpus information") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Expect().ToNot(Equal(0)) = Expect().ToNot(BeEmpty())
bestEffortPod = makePod(ctx, workerRTNode, false) | ||
err = testclient.Client.Create(ctx, bestEffortPod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we guarantee that the pod will be created with a different name from the GU pod?
_, err = pods.WaitForCondition(ctx, client.ObjectKeyFromObject(guaranteedPod), corev1.PodReady, corev1.ConditionTrue, 5*time.Minute) | ||
Expect(err).ToNot(HaveOccurred(), "Guaranteed pod did not become ready in time") | ||
Expect(guaranteedPod.Status.QOSClass).To(Equal(corev1.PodQOSGuaranteed), "Guaranteed pod does not have the correct QOSClass") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an info report would be helpful here, something like:
klog.InfoS("<QoS> pod %s/%s was successfully created", updatedPodObj.Namespace,updatedPodObj.Name)
same for the best-effort
AfterEach(func() { | ||
By("Cleaning up pods") | ||
deleteTestPod(ctx, guaranteedPod) | ||
deleteTestPod(ctx, bestEffortPod) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can avoid after each because the creation of the pods is done in the specific It(), you can convert that to defer after each pods creation (as close as possible to the creation):
guaranteedPod = makePod(ctx, workerRTNode, true)
err := testclient.Client.Create(ctx, guaranteedPod)
Expect(err).ToNot(HaveOccurred(), "Failed to create guaranteed pod")
defer deleteTestPod(ctx, guaranteedPod)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this tests checks the correct thing. We do check a BE pod has no overlap with CPUs exclusively assigned to a Guaranteed pod, but the problem here is not what happens at runtime, but what happened at pod creation time. Once the pod goes running, runc
is terminated, and there's no trace of where did it run
@SargunNarula: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@ffromani The original issue identified was that when launching a guaranteed pod running a cyclic test, the runc container creation process was observed to be running on isolated CPUs. This process inadvertently utilized the CPUs allocated to the cyclic test. The resolution involved ensuring that the cpuset.cpus configuration is passed during container creation. Additionally, since runc follows a two-step creation process, the initialization process (executed as /usr/bin/pod, which is a symlink to /usr/bin/runc) is started within a container. This container is assigned the cpuset.cpus values. This behavior can be confirmed by examining the config.json of the initialization container to verify that the appropriate CPU allocation is applied, reserved CPUs in the case of a guaranteed pod, or all available CPUs in the case of a Best-Effort (BE) pod. Reference: Based on these observations, the current patch may not effectively validate this scenario. I will work on a revised patch to accurately verify the CPUs being utilized. |
Adding a test to verify that runc does not use CPUs assigned to guaranteed pods.
Original bug link - https://bugzilla.redhat.com/show_bug.cgi?id=1910386