Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EC2 Fleet with AutoScaling Group receives scale-down request prematurely #436

Open
mtn-boblloyd opened this issue Mar 18, 2024 · 1 comment
Labels

Comments

@mtn-boblloyd
Copy link

mtn-boblloyd commented Mar 18, 2024

Issue Details

Describe the bug
We have an EC2 Fleet set up to run instances on an AutoScaling Group, and have a default minimum size of 0 (we don't want to have instances running if there are no jobs running). We set up the fleet using Terraform, and set up Jenkins using CasC to configure the fleet within Jenkins. (below are the configurations for the fleet in CasC).

However, we are seeing that nodes created using the AutoScaling Group we created are terminated after ~15 minutes of uptime, due to an event being sent to AWS to tell the ASG to scale down from 1 to 0. This happens consistently, as we had a job running that would restart if the node was lost, and it ran repeatedly overnight, restarting very 15 minutes.

In the CloudTrail events, we see the ASG receives a request from the ec2-fleet-plugin to scale down from 1 to 0 nodes, then the instance is terminated, even though there is a job running, and Scale In protection is enabled. In the ASG events, we see that the scale down request was prevented by the Scale In protections, but the node is still terminated anyway.

I saw a few other tickets where this issue was related to the maximum jobs for the node, our's is set to -1 to have unlimited uses. In addition, it never actually finishes any job, as the job it's running takes longer than 15 minutes.

To Reproduce

  1. Create an ASG with a default desired capacity of 0
  2. Configure the ASG in the Jenkins EC2 Fleet with an original desired capacity of 0
  3. Start a long running (more than 15 minutes) job in AWS that runs on this ASG
  4. After ~15 minutes, the node is terminated, and a new node is started immediately.

** Logs **
From the ASG Events (Note the time of the scale-down cancel is 12:35:20Z:

Successful	Terminating EC2 instance: i-08912acdba64190f4	At 2024-03-18T12:36:08Z an instance was taken out of service in response to an EC2 health check indicating it has been terminated or stopped.	2024 March 18, 08:36:08 AM -04:00	2024 March 18, 08:36:50 AM -04:00
Cancelled	Could not scale to desired capacity because all remaining instances are protected from scale-in.	At 2024-03-18T12:35:20Z a user request update of AutoScalingGroup constraints to min: 0, max: 15, desired: 0 changing the desired capacity from 1 to 0. At 2024-03-18T12:35:31Z group reached equilibrium.	2024 March 18, 08:35:31 AM -04:00	2024 March 18, 08:35:31 AM -04:00
Successful	Launching a new EC2 instance: i-08912acdba64190f4	At 2024-03-18T12:28:08Z an instance was launched in response to an unhealthy instance needing to be replaced.	2024 March 18, 08:28:11 AM -04:00	2024 March 18, 08:32:17 AM -04:00

From the AWS CloudTrail logs for the Instance (this event is at 12:35:20Z as well, and was initiated by the ec2-fleet-plugin):

{
    "eventVersion": "1.09",
    "userIdentity": {
        "type": "IAMUser",
        "principalId": "...",
        "arn": "arn:aws:iam::...:user/devops-jenkins-service",
        "accountId": "...",
        "accessKeyId": "...",
        "userName": "devops-jenkins-service"
    },
    "eventTime": "2024-03-18T12:35:20Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "TerminateInstances",
    "awsRegion": "us-west-1",
    "sourceIPAddress": "...",
    "userAgent": "ec2-fleet-plugin, aws-sdk-java/1.12.287 Linux/6.5.0-1014-aws OpenJDK_64-Bit_Server_VM/11.0.22+7-post-Ubuntu-0ubuntu222.04.1 java/11.0.22 groovy/2.4.21 vendor/Ubuntu cfg/retry-mode/legacy",
    "requestParameters": {
        "instancesSet": {
            "items": [
                {
                    "instanceId": "i-123"
                }
            ]
        }
    },
    "responseElements": {
        "requestId": "...",
        "instancesSet": {
            "items": [
                {
                    "instanceId": "i-123",
                    "currentState": {
                        "code": 32,
                        "name": "shutting-down"
                    },
                    "previousState": {
                        "code": 16,
                        "name": "running"
                    }
                }
            ]
        }
    },
    "requestID": "...",
    "eventID": "...",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "...",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.3",
        "cipherSuite": "TLS_AES_128_GCM_SHA256",
        "clientProvidedHostHeader": "ec2.us-west-1.amazonaws.com"
    }
}

From the AWS CloudTrail logs for this AutoScaling Group (this event is at 12:35:20Z, and was initiated by the ec2-fleet-plugin):

{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "IAMUser",
        "principalId": "AID...COV",
        "arn": "arn:aws:iam::...:user/devops-jenkins-service",
        "accountId": "...",
        "accessKeyId": "...",
        "userName": "devops-jenkins-service"
    },
    "eventTime": "2024-03-18T12:35:20Z",
    "eventSource": "autoscaling.amazonaws.com",
    "eventName": "UpdateAutoScalingGroup",
    "awsRegion": "us-west-1",
    "sourceIPAddress": "...",
    "userAgent": "ec2-fleet-plugin, aws-sdk-java/1.12.287 Linux/6.5.0-1014-aws OpenJDK_64-Bit_Server_VM/11.0.22+7-post-Ubuntu-0ubuntu222.04.1 java/11.0.22 groovy/2.4.21 vendor/Ubuntu cfg/retry-mode/legacy",
    "requestParameters": {
        "newInstancesProtectedFromScaleIn": true,
        "maxSize": 15,
        "minSize": 0,
        "desiredCapacity": 0,
        "autoScalingGroupName": "jenkins-build-node-windows-large - prod"
    },
    "responseElements": null,
    "requestID": "9....7",
    "eventID": "0.....f",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "...",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.3",
        "cipherSuite": "TLS_AES_128_GCM_SHA256",
        "clientProvidedHostHeader": "autoscaling.us-west-1.amazonaws.com"
    }
}

I've included the best logs I could find. There are no additional FINE logs in the ec2fleet logger or default logger in Jenkins that have any additional information.

Jenkins Logs at the time of the scale-down request (there are no mentions of scale-down, only that the node is already terminated (no logs at 12:35:20Z):

Mar 18 12:35:01 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:01.295+0000 [id=84]        INFO        c.a.j.e.EC2FleetOnlineChecker#run: No connection to node 'ec2-fleet-linux-medium i-123'. Attempting to connect and waiting before retry
Mar 18 12:35:01 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:01.376+0000 [id=63770]        WARNING        h.plugins.sshslaves.SSHLauncher#launch: SSH Launch of i-123 on 10.2.10.16 failed in 80 ms
Mar 18 12:35:06 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:06.266+0000 [id=42]        INFO        c.a.j.e.NoDelayProvisionStrategy#apply: label [ec2-fleet-linux-medium]: No excess workload, provisioning not needed.
Mar 18 12:35:06 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:06.266+0000 [id=42]        INFO        c.a.j.e.NoDelayProvisionStrategy#apply: label [ec2-fleet-windows-large]: No excess workload, provisioning not needed.
Mar 18 12:35:13 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:13.353+0000 [id=84]        INFO        c.a.j.e.EC2FleetOnlineChecker#run: No connection to node 'ec2-fleet-windows-large i-234'. Attempting to connect and waiting before retry
Mar 18 12:35:14 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:14.058+0000 [id=63770]        WARNING        h.plugins.sshslaves.SSHLauncher#launch: SSH Launch of i-234 on 10.2.0.176 failed in 703 ms
Mar 18 12:35:16 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:16.266+0000 [id=39]        INFO        c.a.j.e.NoDelayProvisionStrategy#apply: label [ec2-fleet-linux-medium]: No excess workload, provisioning not needed.
Mar 18 12:35:16 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:16.266+0000 [id=39]        INFO        c.a.j.e.NoDelayProvisionStrategy#apply: label [ec2-fleet-windows-large]: No excess workload, provisioning not needed.
Mar 18 12:35:16 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:16.296+0000 [id=84]        INFO        c.a.j.e.EC2FleetOnlineChecker#run: No connection to node 'ec2-fleet-linux-medium i-123'. Attempting to connect and waiting before retry
Mar 18 12:35:16 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:16.363+0000 [id=63770]        WARNING        h.plugins.sshslaves.SSHLauncher#launch: SSH Launch of i-123 on 10.2.10.16 failed in 66 ms
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.136+0000 [id=44]        INFO        c.a.j.ec2fleet.EC2FleetCloud#info: ec2-fleet-windows-large [ec2-fleet-windows-large] Fleet 'ec2-fleet-windows-large' no longer has the instance 'i-234'. Removing instance from Jenkins
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.136+0000 [id=63770]        INFO        c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: DISCONNECTED: ec2-fleet-windows-large i-234
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.137+0000 [id=63770]        INFO        c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: Start retriggering executors for ec2-fleet-windows-large i-234
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.137+0000 [id=63770]        INFO        c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: Finished retriggering executors for ec2-fleet-windows-large i-234
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.352+0000 [id=44]        INFO        c.a.j.e.EC2RetentionStrategy#isIdleForTooLong: Instance ec2-fleet-windows-medium i-04c2baecb91271418 has been idle for too long (Age: 23006529, Max Age: 300000).
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.352+0000 [id=44]        INFO        c.a.j.ec2fleet.EC2FleetCloud#info: ec2-fleet-windows-medium [ec2-fleet-windows-medium ec2-fleet] Not scheduling instance 'i-345' for termination because we need a minimum of 2 instance(s) running
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.352+0000 [id=44]        INFO        c.a.j.e.EC2RetentionStrategy#isIdleForTooLong: Instance:ec2-fleet-windows-medium i-456 Age: 39979 Max Age:300000
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.352+0000 [id=44]        INFO        c.a.j.e.EC2RetentionStrategy#isIdleForTooLong: Instance:ec2-fleet-linux-medium i-123 Age: 7056 Max Age:300000
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.353+0000 [id=44]        INFO        c.a.j.ec2fleet.EC2FleetCloud#info: ec2-fleet-windows-large [ec2-fleet-windows-large] Skipping label update, the Jenkins node for instance 'i-234' was null
Mar 18 12:35:26 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:26.266+0000 [id=46]        INFO        c.a.j.e.NoDelayProvisionStrategy#apply: label [ec2-fleet-linux-medium]: No excess workload, provisioning not needed.
Mar 18 12:35:26 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:26.266+0000 [id=46]        INFO        c.a.j.e.NoDelayProvisionStrategy#apply: label [ec2-fleet-windows-large]: No excess workload, provisioning not needed.
Mar 18 12:35:28 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:28.354+0000 [id=84]        INFO        c.a.j.e.EC2FleetOnlineChecker#run: No connection to node 'ec2-fleet-windows-large i-234'. Waiting before retry

Environment Details

Plugin Version?
3.2.0 and 2.7.0 (tried both versions)

Jenkins Version?
2.440.1

Spot Fleet or ASG?
ASG

Label based fleet?
Yes

Linux or Windows?
Windows

EC2Fleet Configuration as Code

- eC2Fleet:
        addNodeOnlyIfRunning: false
        alwaysReconnect: false
        awsCredentialsId: "AWS_SECURITY_CREDENTIALS_DEV-INFRA"
        cloudStatusIntervalSec: 10
        computerConnector:
          sSHConnector:
            credentialsId: "windows-ssh-key"
            launchTimeoutSeconds: 60
            maxNumRetries: 10
            port: 22
            retryWaitTime: 15
            sshHostKeyVerificationStrategy: "nonVerifyingKeyVerificationStrategy"
        disableTaskResubmit: false
        fleet: "jenkins-build-node-windows-large - prod"
        fsRoot: "c:\\Jenkins"
        idleMinutes: 5
        initOnlineCheckIntervalSec: 15
        initOnlineTimeoutSec: 599
        labelString: "ec2-fleet-windows-large"
        maxSize: 15
        maxTotalUses: -1
        minSize: 0
        minSpareSize: 0
        name: "ec2-fleet-windows-large"
        noDelayProvision: true
        numExecutors: 1
        privateIpUsed: true
        region: "us-west-1"
        restrictUsage: true
        scaleExecutorsByWeight: false
@chliviu
Copy link

chliviu commented Mar 22, 2024

We are experiencing same behavior, on the same setup as described but with Linux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants