-
Notifications
You must be signed in to change notification settings - Fork 44
Fix nil pointer panic and spurious Auto Mode updates #171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix nil pointer panic and spurious Auto Mode updates #171
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: demikl The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @demikl. Thanks for your PR. I'm waiting for a aws-controllers-k8s member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Auto Mode Configuration LogicThis PR adds proper validation and handling for EKS Auto Mode, which has specific requirements from AWS: Auto Mode Requirements:
Changes Made:
Breaking Change: Custom Resources with partial Auto Mode configurations that previously succeeded will now fail validation with clear error messages, guiding users toward correct configurations. This ensures the controller behavior matches AWS Auto Mode requirements and prevents the nil pointer crashes reported in issue aws-controllers-k8s/community#2619. |
8290936
to
85a97c1
Compare
/ok-to-test |
Hey @demikl , Thank you for this!! Can you add a test for the auto-mode behavior? Creating a non-auto cluster and making it auto mode and one trying to turn it off with incorrect parameterS.. |
/retest |
Hi @rushmash91, I’m seeing behavior in the Auto Mode activation test that I can’t explain: Test flow (
|
Hi @demikl , Questions: I usually test the api describe and update behavior directly via the CLI to see if there are any caveats being missed. |
Thanks for the clarification, @rushmash91. I can reliably reproduce the expected behavior (DescribeCluster showing the three Auto Mode sections) when running the same workflow locally/manually. Because I still can’t determine why the Prow run’s DescribeCluster response omits those sections after a Successful AutoModeUpdate, I’ve updated the test to assert the transition using list-updates + describe-update only (type=AutoModeUpdate, status=Successful, three params all enabled). This has been consistent across runs. Let me know if you’d prefer that I keep a (soft) DescribeCluster check as a best-effort, or leave it as-is with the update-based validation. |
/retest |
1 similar comment
/retest |
Hi, do you need anything more from me for this PR to be accepted? |
Hey @demikl , thank you! this is great! |
- Add isAutoModeCluster() function to detect valid Auto Mode configurations - Add validateAutoModeConfig() to enforce AWS requirement that compute, storage, and load balancing must all be enabled/disabled together - Only call updateComputeConfig() for actual Auto Mode clusters - Ignore elasticLoadBalancing absent vs false diffs for non-Auto Mode clusters Fixes aws-controllers-k8s/community#2619
a00dd91
to
d5a11c1
Compare
I've merged the tests as requested. It looks like there is a flaky test that impacts my PR checks 😞 : test_cluster_adopt_update |
/test eks-kind-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @demikl ,
Thank you! the tests look good, left a few small nits..
|
||
|
||
@pytest.fixture | ||
def simple_cluster(eks_client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a fixture in the test_cluster.py, can we reuse that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, done in commit:273c53cae653e1257a2df9742447c0cdee49ccc6
# Time (seconds) to wait after EKS DescribeCluster reports the cluster ACTIVE before | ||
# re-reading the CR. This gives the controller a chance to observe the external state | ||
# transition and update the CR status fields. | ||
POST_ACTIVE_REFRESH_SECONDS = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can declare these at the start of the file, that is the convention we have for the integration tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we can use the same timeouts above for MODIFY_WAIT_AFTER_SECONDS,
CHECK_STATUS_WAIT_SECONDS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially tried to reuse the existing constants, but their values (240 seconds) were too large and caused my tests to fail due to timeouts. To avoid disrupting existing functionality that isn't related to my PR, I opted to define new constants with more appropriate timing values for the Auto Mode tests.
However, I'm open to reducing the existing constants if that works better for the codebase. I can test whether lowering the original timeout values breaks any existing tests and update accordingly if they all pass.
} | ||
} | ||
k8s.patch_custom_resource(ref, patch_enable_auto_mode) | ||
time.sleep(self.PATCH_RECONCILE_GRACE_SECONDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reconciliation should be immediate, is this required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely correct. Since this error originates from the controller's validation logic rather than the EKS API, the status should indeed be reconciled immediately without waiting for external API responses. Fixed in commit:bee62c8a374b4a18a1c9e54de18108bf26582123
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the latest e2e tests, it looks like we definitively need some time for the controller to PATCH the cluster custom resource with the new status showing the expected error. I'm gonna re-add a slight delay.
/retest |
Hey @demikl , I see the your tests are failing, We can merge this if they pass. The changes look good to me 🙂 |
In the latest run, the 2 tests that are failing are out-of-scope of my changes. I hope the current state is OK for you ? |
/retest |
/test eks-kind-e2e |
b2c711b
to
3a92e5c
Compare
3a92e5c
to
3b5856f
Compare
@demikl: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Description
Fixes nil pointer panic in updateComputeConfig and prevents spurious Auto Mode updates for non-Auto Mode clusters
Related Issue
Fixes aws-controllers-k8s/community#2619
Changes
isAutoModeCluster()
function to detect Auto Mode clustersupdateComputeConfig()
for actual Auto Mode clustersBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.