Fix nil pointer panic and spurious Auto Mode updates #171

demikl · 2025-09-04T14:24:01Z

Description

Fixes nil pointer panic in updateComputeConfig and prevents spurious Auto Mode updates for non-Auto Mode clusters

Related Issue

Fixes aws-controllers-k8s/community#2619

Changes

Add isAutoModeCluster() function to detect Auto Mode clusters
Only invoke updateComputeConfig() for actual Auto Mode clusters
Ignore `auto-mode field updates for non-Auto Mode clusters

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

ack-prow · 2025-09-04T14:24:06Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: demikl
Once this PR has been reviewed and has the lgtm label, please assign jlbutler for approval by writing /assign @jlbutler in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ack-prow · 2025-09-04T14:24:11Z

Hi @demikl. Thanks for your PR.

I'm waiting for a aws-controllers-k8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

demikl · 2025-09-04T15:11:59Z

⚠️ Breaking Change Notice

This PR introduces a breaking change in validation behavior for Auto Mode configurations. Users who previously had partial Auto Mode configurations in their Custom Resources will now encounter validation errors.

What Changed

Previously, the controller allowed partial Auto Mode configurations where only one or two of the three required capabilities were specified. This led to:

Nil pointer panics when ComputeConfig was present but other configs were missing
Confusing behavior where missing fields were implicitly treated as false
Spurious reconciliation loops due to AWS API inconsistencies

New Validation Rules

With this change, if any Auto Mode configuration is specified, ALL three capabilities must be explicitly declared:

spec.computeConfig.enabled
spec.storageConfig.blockStorage.enabled
spec.kubernetesNetworkConfig.elasticLoadBalancing.enabled

Example of Breaking Change

This configuration will now fail validation:

apiVersion: eks.services.k8s.aws/v1alpha1
kind: Cluster
metadata:
  name: self
spec:
  computeConfig:
    enabled: false  # Only one capability specified
  kubernetesNetworkConfig:
    ipFamily: ipv4
    serviceIPv4CIDR: 172.20.0.0/16

Error result:

status:
  conditions:
  - lastTransitionTime: "2025-09-04T15:03:00Z"
    status: "True"
    type: ACK.ResourceSynced
  - message: 'invalid Auto Mode configuration: when configuring Auto Mode, all three
      capabilities must be specified (compute=true, storage=false, elb=false)'
    status: "True"
    type: ACK.Terminal

How to Fix Your Custom Resources

Users have three options:

Option 1: Enable Auto Mode completely

spec:
  computeConfig:
    enabled: true
  storageConfig:
    blockStorage:
      enabled: true
  kubernetesNetworkConfig:
    elasticLoadBalancing:
      enabled: true

Option 2: Disable Auto Mode completely

spec:
  computeConfig:
    enabled: false
  storageConfig:
    blockStorage:
      enabled: false
  kubernetesNetworkConfig:
    elasticLoadBalancing:
      enabled: false

Option 3: Don't use Auto Mode

Remove all Auto Mode configurations from your CR entirely.

Why This Change is Necessary

Prevents crashes: Eliminates nil pointer panics when partial configurations are present
Follows AWS requirements: EKS Auto Mode requires all three capabilities to be configured together
Improves user experience: Clear validation errors instead of silent failures or crashes
Reduces confusion: No more implicit false treatment of missing fields

This aligns the controller behavior with AWS EKS Auto Mode documentation and prevents the undefined behavior that was causing production issues.

demikl · 2025-09-04T15:32:48Z

Auto Mode Configuration Logic

This PR adds proper validation and handling for EKS Auto Mode, which has specific requirements from AWS:

Auto Mode Requirements:

Activate: All three capabilities (compute, storage, elasticLoadBalancing) must be true
Deactivate: All three capabilities must be false
Invalid: Any partial or inconsistent configuration (e.g., compute=true, storage=false)

Changes Made:

Strict Validation: The controller now validates that Auto Mode configurations are complete and consistent. If a Custom Resource specifies only some capabilities or has inconsistent true/false values, the resource status is set to failed with a clear error message.
Auto Mode Detection: Only clusters with valid Auto Mode configurations (all true or all false) will trigger Auto Mode-specific logic. Other clusters skip Auto Mode processing entirely.
AWS API Resilience: If the EKS API returns inconsistent responses (which we've observed with the elasticLoadBalancing field), the controller ignores these inconsistencies rather than crashing, preventing API quirks from affecting cluster management.

Breaking Change: Custom Resources with partial Auto Mode configurations that previously succeeded will now fail validation with clear error messages, guiding users toward correct configurations.

This ensures the controller behavior matches AWS Auto Mode requirements and prevents the nil pointer crashes reported in issue aws-controllers-k8s/community#2619.

rushmash91 · 2025-09-05T01:38:25Z

/ok-to-test

rushmash91 · 2025-09-05T16:50:53Z

Hey @demikl , Thank you for this!!

Can you add a test for the auto-mode behavior? Creating a non-auto cluster and making it auto mode and one trying to turn it off with incorrect parameterS..

demikl · 2025-09-08T08:14:31Z

/retest

demikl · 2025-09-09T13:05:30Z

Hi @rushmash91,

I’m seeing behavior in the Auto Mode activation test that I can’t explain:

Test flow (`test_enable_auto_mode_on_standard_cluster` in `test_cluster_automode_updates.py`):

Create a standard (non–Auto Mode) cluster using the existing simple-cluster template.
Wait until the cluster is ACTIVE (both CR status and eks:DescribeCluster agree).
Patch the Cluster CR to add:

spec.computeConfig.enabled = true
spec.storageConfig.blockStorage.enabled = true
spec.kubernetesNetworkConfig.elasticLoadBalancing.enabled = true

Cluster transitions to UPDATING (observed in CR status and eks:DescribeCluster).
Cluster returns to ACTIVE.
The CR now contains the three Auto Mode sections with enabled=true and the controller reports no further drift.
However, eks:DescribeCluster still does not include computeConfig, storageConfig, or kubernetesNetworkConfig.elasticLoadBalancing (the fields are absent, not just disabled).

Additional debugging I added:

Logged intermediate CR + DescribeCluster snapshots (pre‑patch, during UPDATING, after ACTIVE).
Called list_updates + describe_update for each updateId: there is exactly one update of type AutoModeUpdate with status Successful.
This suggests the service accepted and applied the transition.
When I manually reproduce the same workflow in my company’s AWS account (create standard cluster → patch to enable Auto Mode → describe), the DescribeCluster response DOES include the three sections as expected.

Questions / hypotheses:

Do Prow tests run against the real EKS API or some proxy/mock that may not yet surface these new fields?
Is there longer propagation latency in this environment? (I currently poll up to 180s.)
Could the region under test have partial rollout of the Auto Mode describe fields?
Are additional fields (e.g. nodeRoleARN, nodePools) required in this environment for the sections to appear in DescribeCluster even though just enabled=true works elsewhere?
Is the API silently accepting the update but withholding the fields until another condition is met?

Next steps I can take if helpful:

Enrich the patch with nodeRoleARN (from bootstrap) and optional nodePools.
Extend polling window and dump the raw JSON describe payload.
Add a temporary xfail if this is a known service-side lag or rollout gap.

Could you please confirm:

That the test environment is using the live EKS control plane.
Whether any extra prerequisites are required for DescribeCluster to return the Auto Mode sections after an update.

Happy to adjust the test once I understand the expected behavior here.

Thanks!

rushmash91 · 2025-09-10T17:59:31Z

Hi @demikl ,

Questions:
That the test environment is using the live EKS control plane - Yes, the environment is an EKS cluster
Whether any extra prerequisites are required for DescribeCluster to return the Auto Mode sections after an update - No

I usually test the api describe and update behavior directly via the CLI to see if there are any caveats being missed.

demikl · 2025-09-11T14:09:11Z

Thanks for the clarification, @rushmash91.

I can reliably reproduce the expected behavior (DescribeCluster showing the three Auto Mode sections) when running the same workflow locally/manually.

Because I still can’t determine why the Prow run’s DescribeCluster response omits those sections after a Successful AutoModeUpdate, I’ve updated the test to assert the transition using list-updates + describe-update only (type=AutoModeUpdate, status=Successful, three params all enabled). This has been consistent across runs.

Let me know if you’d prefer that I keep a (soft) DescribeCluster check as a best-effort, or leave it as-is with the update-based validation.

demikl · 2025-09-11T15:20:05Z

/retest

demikl · 2025-09-12T07:24:00Z

/retest

demikl · 2025-09-18T05:32:01Z

Hi, do you need anything more from me for this PR to be accepted?

rushmash91 · 2025-09-19T18:38:36Z

Hey @demikl , thank you! this is great!
One comment i had is can we have the auto-mode test be part of the cluster auto mode test file already there..

- Add isAutoModeCluster() function to detect valid Auto Mode configurations - Add validateAutoModeConfig() to enforce AWS requirement that compute, storage, and load balancing must all be enabled/disabled together - Only call updateComputeConfig() for actual Auto Mode clusters - Ignore elasticLoadBalancing absent vs false diffs for non-Auto Mode clusters Fixes aws-controllers-k8s/community#2619

demikl · 2025-09-22T08:57:57Z

I've merged the tests as requested. It looks like there is a flaky test that impacts my PR checks 😞 : test_cluster_adopt_update

demikl · 2025-09-22T12:24:59Z

/test eks-kind-e2e

rushmash91

Hi @demikl ,

Thank you! the tests look good, left a few small nits..

rushmash91 · 2025-09-24T18:08:21Z

test/e2e/tests/test_cluster_automode.py

+
+
+@pytest.fixture
+def simple_cluster(eks_client):


there is a fixture in the test_cluster.py, can we reuse that?

sure, done in commit:273c53cae653e1257a2df9742447c0cdee49ccc6

rushmash91 · 2025-09-24T18:09:48Z

test/e2e/tests/test_cluster_automode.py

+    # Time (seconds) to wait after EKS DescribeCluster reports the cluster ACTIVE before
+    # re-reading the CR. This gives the controller a chance to observe the external state
+    # transition and update the CR status fields.
+    POST_ACTIVE_REFRESH_SECONDS = 30


we can declare these at the start of the file, that is the convention we have for the integration tests

Also we can use the same timeouts above for MODIFY_WAIT_AFTER_SECONDS,
CHECK_STATUS_WAIT_SECONDS

I initially tried to reuse the existing constants, but their values (240 seconds) were too large and caused my tests to fail due to timeouts. To avoid disrupting existing functionality that isn't related to my PR, I opted to define new constants with more appropriate timing values for the Auto Mode tests.

However, I'm open to reducing the existing constants if that works better for the codebase. I can test whether lowering the original timeout values breaks any existing tests and update accordingly if they all pass.

rushmash91 · 2025-09-24T18:13:46Z

test/e2e/tests/test_cluster_automode.py

+            }
+        }
+        k8s.patch_custom_resource(ref, patch_enable_auto_mode)
+        time.sleep(self.PATCH_RECONCILE_GRACE_SECONDS)


the reconciliation should be immediate, is this required?

You're absolutely correct. Since this error originates from the controller's validation logic rather than the EKS API, the status should indeed be reconciled immediately without waiting for external API responses. Fixed in commit:bee62c8a374b4a18a1c9e54de18108bf26582123

According to the latest e2e tests, it looks like we definitively need some time for the controller to PATCH the cluster custom resource with the new status showing the expected error. I'm gonna re-add a slight delay.

pkg/resource/cluster/hook.go

rushmash91 · 2025-09-25T20:51:16Z

/retest

rushmash91 · 2025-09-26T06:29:24Z

Hey @demikl ,

I see the your tests are failing, We can merge this if they pass. The changes look good to me 🙂

demikl · 2025-09-26T08:36:20Z

Hey @demikl ,

I see the your tests are failing, We can merge this if they pass. The changes look good to me 🙂

In the latest run, the 2 tests that are failing are out-of-scope of my changes. I hope the current state is OK for you ?

rushmash91 · 2025-09-29T17:17:01Z

/retest

a-hilaly · 2025-09-29T18:37:35Z

/test eks-kind-e2e

ack-prow · 2025-10-03T19:15:35Z

@demikl: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
eks-verify-attribution	`3b5856f`	link	false	`/test eks-verify-attribution`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ack-prow bot requested review from a-hilaly and michaelhtm September 4, 2025 14:24

ack-prow bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 4, 2025

demikl mentioned this pull request Sep 4, 2025

eks-controller: Nil pointer panic in updateComputeConfig + false drift on elasticLoadBalancing for non–Auto Mode clusters aws-controllers-k8s/community#2619

Open

demikl force-pushed the fix-automode-nil-pointer-issue-2619 branch from 8290936 to 85a97c1 Compare September 4, 2025 15:42

cheeseandcereal requested a review from rushmash91 September 4, 2025 18:36

ack-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 5, 2025

demikl force-pushed the fix-automode-nil-pointer-issue-2619 branch from a00dd91 to d5a11c1 Compare September 22, 2025 06:29

rushmash91 reviewed Sep 24, 2025

View reviewed changes

rushmash91 force-pushed the fix-automode-nil-pointer-issue-2619 branch 2 times, most recently from b2c711b to 3a92e5c Compare October 3, 2025 19:10

fix-automode-nil-pointer-issue

3b5856f

rushmash91 force-pushed the fix-automode-nil-pointer-issue-2619 branch from 3a92e5c to 3b5856f Compare October 3, 2025 19:15



		@pytest.fixture
		def simple_cluster(eks_client):

Fix nil pointer panic and spurious Auto Mode updates #171

Are you sure you want to change the base?

Fix nil pointer panic and spurious Auto Mode updates #171

Conversation

demikl commented Sep 4, 2025 • edited by rushmash91 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Changes

Uh oh!

ack-prow bot commented Sep 4, 2025

Uh oh!

ack-prow bot commented Sep 4, 2025

Uh oh!

demikl commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Breaking Change Notice

What Changed

New Validation Rules

Example of Breaking Change

How to Fix Your Custom Resources

Option 1: Enable Auto Mode completely

Option 2: Disable Auto Mode completely

Option 3: Don't use Auto Mode

Why This Change is Necessary

Uh oh!

demikl commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Auto Mode Configuration Logic

Uh oh!

rushmash91 commented Sep 5, 2025

Uh oh!

rushmash91 commented Sep 5, 2025

Uh oh!

demikl commented Sep 8, 2025

Uh oh!

demikl commented Sep 9, 2025

Test flow (test_enable_auto_mode_on_standard_cluster in test_cluster_automode_updates.py):

Additional debugging I added:

Questions / hypotheses:

Next steps I can take if helpful:

Could you please confirm:

Uh oh!

rushmash91 commented Sep 10, 2025

Uh oh!

demikl commented Sep 11, 2025

Uh oh!

demikl commented Sep 11, 2025

Uh oh!

demikl commented Sep 12, 2025

Uh oh!

demikl commented Sep 18, 2025

Uh oh!

rushmash91 commented Sep 19, 2025

Uh oh!

demikl commented Sep 22, 2025

Uh oh!

demikl commented Sep 22, 2025

Uh oh!

rushmash91 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rushmash91 commented Sep 25, 2025

demikl commented Sep 4, 2025 •

edited by rushmash91

Loading

demikl commented Sep 4, 2025 •

edited

Loading

demikl commented Sep 4, 2025 •

edited

Loading

Test flow (`test_enable_auto_mode_on_standard_cluster` in `test_cluster_automode_updates.py`):