Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS: Cluster Deletion Fails #32395

Open
1 task
hakenmt opened this issue Dec 5, 2024 · 5 comments
Open
1 task

EKS: Cluster Deletion Fails #32395

hakenmt opened this issue Dec 5, 2024 · 5 comments
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. effort/large Large work item – several weeks of effort p2

Comments

@hakenmt
Copy link

hakenmt commented Dec 5, 2024

Describe the bug

A CFN stack containing an EKS cluster failed and attempted to roll back. The OnEventHandler custom resource that's responsible for handling cluster deletion failed to delete the resource with permissions error. From the CW logs:

2024-12-05T07:12:25.344Z	df9e8909-0564-4c9e-9529-5546180edcec	ERROR	{
  clientName: 'EKSClient',
  commandName: 'DeleteClusterCommand',
  input: {
    name: 'multi-az-workshop-EKSNestedStackEKSNestedStackResourceAE427C53-2PG5GFQGAIDA-ClusterEKSClusterEAC9DE5C-N7TGM3G9041D'
  },
  error: AccessDeniedException: User: arn:aws:sts::123456789012:assumed-role/multi-az-workshop-EKSNest-ClusterEKSClusterCreation-2nnnu7xJhiuj/AWSCDK.EKSCluster.Delete.9e794145-daf9-44f3-88f2-d0cd7c694239 is not authorized to perform: eks:DeleteCluster on resource: arn:aws:eks:us-west-2:123456789012:cluster/multi-az-workshop-EKSNestedStackEKSNestedStackResourceAE427C53-2PG5GFQGAIDA-ClusterEKSClusterEAC9DE5C-N7TGM3G9041D
      at de_AccessDeniedExceptionRes (/var/runtime/node_modules/@aws-sdk/client-eks/dist-cjs/index.js:2546:21)
      at de_CommandError (/var/runtime/node_modules/@aws-sdk/client-eks/dist-cjs/index.js:2519:19)
      at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
      at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/middleware-serde/dist-cjs/index.js:35:20
      at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/core/dist-cjs/index.js:165:18
      at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/middleware-retry/dist-cjs/index.js:320:38
      at async /var/runtime/node_modules/@aws-sdk/middleware-logger/dist-cjs/index.js:34:22
      at async Xi.onDelete (/var/task/index.js:57:649490) {
    '$fault': 'client',
    '$metadata': {
      httpStatusCode: 403,
      requestId: '68a18a1d-2119-42f0-aeae-9253766fdf5a',
      extendedRequestId: undefined,
      cfId: undefined,
      attempts: 1,
      totalRetryDelay: 0
    }
  },
  metadata: {
    httpStatusCode: 403,
    requestId: '68a18a1d-2119-42f0-aeae-9253766fdf5a',
    extendedRequestId: undefined,
    cfId: undefined,
    attempts: 1,
    totalRetryDelay: 0
  }
}

However, this doesn't happen all of the time, and am wondering if there is a hidden race condition during a stack rollback where the permissions policy may get deleted before the function and assigned role are deleted?

Regression Issue

  • Select this option if this issue appears to be a regression.

Last Known Working CDK Version

No response

Expected Behavior

I would expect the automatically created custom resource and IAM role to have the appropriate permissions.

Current Behavior

Sometimes, cluster deletion fails with a 403 error.

Reproduction Steps

I don't have specific reproduction steps since the behavior is transient. This is basically the cluster resource definition:

Cluster cluster = new Cluster(this, "EKSCluster", new ClusterProps(){
                Vpc = props.Vpc,
                VpcSubnets = new SubnetSelection[] { new SubnetSelection() { SubnetType = SubnetType.PRIVATE_ISOLATED } },
                DefaultCapacity = 0,
                Version =  KubernetesVersion.V1_31,
                PlaceClusterHandlerInVpc = false,
                EndpointAccess = EndpointAccess.PUBLIC_AND_PRIVATE,
                KubectlLayer = kubetctlLayer,
                SecurityGroup = controlPlaneSG,
                MastersRole = props.AdminRole,
                ClusterName = props.ClusterName,
                ClusterLogging = new ClusterLoggingTypes[] { ClusterLoggingTypes.CONTROLLER_MANAGER, ClusterLoggingTypes.AUTHENTICATOR, ClusterLoggingTypes.API, ClusterLoggingTypes.AUDIT, ClusterLoggingTypes.SCHEDULER}
            });

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.164.1

Framework Version

No response

Node.js Version

20

OS

darwin

Language

.NET

Language Version

No response

Other information

No response

@hakenmt hakenmt added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Dec 5, 2024
@github-actions github-actions bot added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Dec 5, 2024
@ashishdhingra
Copy link
Contributor

@hakenmt Good afternoon. Thanks for opening the issue. Could you please confirm if the issue is reproducible using the latest version of CDK? Also, could you also check permissions of EKSClusterCreationRoleXXXXXXXX when this issue happens? As you pointed out, there could be a race condition which might be causing permissions to be removed before actual cluster deletion.

Thanks,
Ashish

@ashishdhingra ashishdhingra added p2 response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-triage This issue or PR still needs to be triaged. labels Dec 6, 2024
@ashishdhingra ashishdhingra self-assigned this Dec 6, 2024
Copy link

github-actions bot commented Dec 9, 2024

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

@github-actions github-actions bot added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Dec 9, 2024
@hakenmt
Copy link
Author

hakenmt commented Dec 9, 2024

It looks like the cluster custom resource has an explicit depends on for the policy:

"ClusterEKSClusterEAC9DE5C": {
   "Type": "Custom::AWSCDK-EKS-Cluster",
...
"DependsOn": [
    "ClusterClusterLogGroup1B27A8B5",
    "ClusterEKSClusterCreationRoleDefaultPolicy893CB762",
    "ClusterEKSClusterCreationRoleC6196114"
   ]
}

So it doesn't look like this should fail (but obviously did). I can't reproduce easily. However, it looks like the policy and the specified resource don't match. From the error log:

arn:aws:sts::123456789012:assumed-role/multi-az-workshop-EKSNest-ClusterEKSClusterCreation-2nnnu7xJhiuj/AWSCDK.EKSCluster.Delete.9e794145-daf9-44f3-88f2-d0cd7c694239 is not authorized to perform: eks:DeleteCluster on resource: arn:aws:eks:us-west-2:123456789012:cluster/multi-az-workshop-EKSNestedStackEKSNestedStackResourceAE427C53-2PG5GFQGAIDA-ClusterEKSClusterEAC9DE5C-N7TGM3G9041D

Where the resource in the policy statement is:

{
       "Action": [
        "eks:CreateCluster",
        "eks:CreateFargateProfile",
        "eks:DeleteCluster",
        "eks:DescribeCluster",
        "eks:DescribeUpdate",
        "eks:TagResource",
        "eks:UntagResource",
        "eks:UpdateClusterConfig",
        "eks:UpdateClusterVersion"
       ],
       "Effect": "Allow",
       "Resource": [
        {
         "Fn::Join": [
          "",
          [
           "arn:",
           {
            "Ref": "AWS::Partition"
           },
           ":eks:",
           {
            "Fn::Sub": "${AWS::Region}"
           },
           ":",
           {
            "Ref": "AWS::AccountId"
           },
           ":cluster/multi-az-workshop-eks-cluster"
          ]
         ]
        },
        {
         "Fn::Join": [
          "",
          [
           "arn:",
           {
            "Ref": "AWS::Partition"
           },
           ":eks:",
           {
            "Fn::Sub": "${AWS::Region}"
           },
           ":",
           {
            "Ref": "AWS::AccountId"
           },
           ":cluster/multi-az-workshop-eks-cluster/*"
          ]
         ]
        }
       ]
      },

And the custom resource is defined as:

ClusterEKSClusterEAC9DE5C": {
   "Type": "Custom::AWSCDK-EKS-Cluster",
   "Properties": {
    "ServiceToken": {
     "Fn::GetAtt": [
      "awscdkawseksClusterResourceProviderNestedStackawscdkawseksClusterResourceProviderNestedStackResource9827C454",
      "Outputs.multiazworkshopEKSawscdkawseksClusterResourceProviderframeworkonEvent8AE90E2DArn"
     ]
    },
    "Config": {
     "name": "multi-az-workshop-eks-cluster",
     "version": "1.31",

So it's unclear if the construct is building the resources using the cluster name provided, but the onEvent function uses a different resource name? I've deployed this resource/stack hundreds of time using the same templates, so also unclear why this would happen.

@github-actions github-actions bot removed closing-soon This issue will automatically close in 4 days unless further comments are made. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Dec 9, 2024
@ashishdhingra
Copy link
Contributor

CC @pahud for visibility

@pahud
Copy link
Contributor

pahud commented Dec 12, 2024

Hi

Yes this is possible and very similar to #31032

The Cluster resource in aws-eks is currently implemented using CustomResource, when you fail to update a property the rollback, in some edge cases, might fail. We will continue investigate this. I am not sure if we are able to get it fixed as the team is working on a new aws-eks-alpha module.

@ashishdhingra ashishdhingra added the effort/large Large work item – several weeks of effort label Dec 12, 2024
@ashishdhingra ashishdhingra removed their assignment Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. effort/large Large work item – several weeks of effort p2
Projects
None yet
Development

No branches or pull requests

3 participants