Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EC2 ASG agents are not assgined to Jenkins fleet tags - Error during fleet '<fleet_name>' stats update java.lang.NullPointerException #429

Open
tofanadrian3000 opened this issue Jan 11, 2024 · 11 comments
Labels

Comments

@tofanadrian3000
Copy link

tofanadrian3000 commented Jan 11, 2024

Issue Details

Describe the bug
We have multiple Amazon EC2 Fleets. All of them were working fine until yesterday (January 10th 2024). As of today, the ec2 fleet plugin started to scale the ASGs behind the fleets up, the ASG instances are started but they are not assigned to the fleet's tags. Therefore, Jenkins is not trying to connect them as agents anymore. Since they are not connected as agents, the plugin keeps scaling the ASGs up until they reach the maximum capacity without any of the agents being actually used as agents.

To Reproduce

  1. Create an Amazon EC2 Fleet by selecting any existing ASG as "EC2 Fleet" with any tag to it.
  2. The EC2 fleet plugin scaled the ASG up whenever a build is pending for a new agent with that tag.
  3. The ASG is being scaled up
  4. The new ASG instance is started
  5. The new ASG instance is not assigned for the tag

** Logs **
Jan 11, 2024 10:29:16 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun
Error during fleet 'g11n-lre-rus-asg' stats update
java.lang.NullPointerException

Jan 11, 2024 10:29:17 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun
Error during fleet 'lt-pc-ci-asg' stats update
java.lang.NullPointerException

Jan 11, 2024 10:29:17 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info
lt-infra-win-ci-asg [lt-infra-win-ci-asg] Set target capacity to '5'
Jan 11, 2024 10:29:17 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun
Error during fleet 'lt-infra-win-ci-asg' stats update
java.lang.NullPointerException

Jan 11, 2024 10:29:17 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun
Error during fleet 'win2022-ec2-fleet' stats update
java.lang.NullPointerException

Jan 11, 2024 10:29:17 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun
Error during fleet 'g11n-lre-chs-asg' stats update
java.lang.NullPointerException

Jan 11, 2024 10:29:18 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun
Error during fleet 'g11n-lre-kor-asg' stats update
java.lang.NullPointerException

Jan 11, 2024 10:29:18 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun
Error during fleet 'nv-lin-ci-asg' stats update
java.lang.NullPointerException

Jan 11, 2024 10:29:18 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun
Error during fleet 'tc2-lin-import-asg' stats update
java.lang.NullPointerException

image

Environment Details

Plugin Version?
3.2.0

Jenkins Version?
2.414.3

Spot Fleet or ASG?
ASG

Label based fleet?
No

Linux or Windows?
So far Windows but it may happen on Linux as well (we haven't tested that yet but I don't think the OS being relevant in this case)

EC2Fleet Configuration as Code
It's just a small part but:
<clouds>
<com.amazon.jenkins.ec2fleet.EC2FleetCloud plugin="[email protected]">
<actions/>
<name>ubuntu22-ec2-fleet</name>
<awsCredentialsId></awsCredentialsId>
<region>eu-central-1</region>
<endpoint></endpoint>
<fleet>ubuntu22tplv2asg_asg</fleet>
<fsRoot></fsRoot>
<computerConnector class="hudson.plugins.sshslaves.SSHConnector" plugin="[email protected]_43357ce4">
<port>22</port>
<credentialsId>jenkins-ubuntu22-asg-slaves-ssh-key</credentialsId>
<launchTimeoutSeconds>60</launchTimeoutSeconds>
<maxNumRetries>10</maxNumRetries>
<retryWaitTime>15</retryWaitTime>
<sshHostKeyVerificationStrategy class="hudson.plugins.sshslaves.verifiers.NonVerifyingKeyVerificationStrategy"/>
<tcpNoDelay>true</tcpNoDelay>
</computerConnector>
<privateIpUsed>true</privateIpUsed>
<alwaysReconnect>true</alwaysReconnect>
<labelString>ubuntu22-ec2-fleet</labelString>
<idleMinutes>5</idleMinutes>
<minSize>1</minSize>
<maxSize>5</maxSize>
<minSpareSize>0</minSpareSize>
<numExecutors>50</numExecutors>
<addNodeOnlyIfRunning>false</addNodeOnlyIfRunning>
<restrictUsage>true</restrictUsage>
<scaleExecutorsByWeight>false</scaleExecutorsByWeight>
<executorScaler class="com.amazon.jenkins.ec2fleet.EC2FleetCloud$NoScaler">
<numExecutors>50</numExecutors>
</executorScaler>
<initOnlineTimeoutSec>300</initOnlineTimeoutSec>
<cloudStatusIntervalSec>10</cloudStatusIntervalSec>
<maxTotalUses>1000</maxTotalUses>
<disableTaskResubmit>false</disableTaskResubmit>
<noDelayProvision>false</noDelayProvision>
</com.amazon.jenkins.ec2fleet.EC2FleetCloud>
<com.amazon.jenkins.ec2fleet.EC2FleetCloud plugin="[email protected]">
<name>lt-pc-ci-asg</name>
<awsCredentialsId></awsCredentialsId>
<region>eu-central-1</region>
<endpoint></endpoint>
<fleet>lt-pc-ci-v2-asg_asg</fleet>
<fsRoot>C:\jenkins</fsRoot>
<computerConnector class="hudson.plugins.sshslaves.SSHConnector" plugin="[email protected]_43357ce4">
<port>22</port>
<credentialsId>jenkins-agents-lrelrpauto-account</credentialsId>
<launchTimeoutSeconds>60</launchTimeoutSeconds>
<maxNumRetries>10</maxNumRetries>
<retryWaitTime>15</retryWaitTime>
<sshHostKeyVerificationStrategy class="hudson.plugins.sshslaves.verifiers.NonVerifyingKeyVerificationStrategy"/>
<tcpNoDelay>true</tcpNoDelay>
</computerConnector>
<privateIpUsed>true</privateIpUsed>
<alwaysReconnect>true</alwaysReconnect>
<labelString>lt-pc-ci-asg</labelString>
<idleMinutes>5</idleMinutes>
<minSize>0</minSize>
<maxSize>10</maxSize>
<minSpareSize>0</minSpareSize>
<numExecutors>1</numExecutors>
<addNodeOnlyIfRunning>false</addNodeOnlyIfRunning>
<restrictUsage>true</restrictUsage>
<scaleExecutorsByWeight>false</scaleExecutorsByWeight>
<executorScaler class="com.amazon.jenkins.ec2fleet.EC2FleetCloud$NoScaler">
<numExecutors>1</numExecutors>
</executorScaler>
<initOnlineTimeoutSec>300</initOnlineTimeoutSec>
<cloudStatusIntervalSec>10</cloudStatusIntervalSec>
<maxTotalUses>-1</maxTotalUses>
<disableTaskResubmit>false</disableTaskResubmit>
<noDelayProvision>false</noDelayProvision>
</com.amazon.jenkins.ec2fleet.EC2FleetCloud>
</clouds>

Anything else unique about your setup?
All the fleets (including the Windows ones) are configured to connect to the agents using ssh. I don't know if it's relevant in this case and it's not quite "unique" but maybe the information helps.

@ajax-koval-i
Copy link

@tofanadrian3000 i have the same issue, after init cloud, my asg init for example one ec2 spot instance. And i cannot see this instance in Jenkins -> Manage Jenkins -> Nodes.

But in AWS i have this instance. Do you have the same problem?

@tofanadrian3000
Copy link
Author

Yeap - it seems similar, indeed

@tofanadrian3000
Copy link
Author

tofanadrian3000 commented Jan 23, 2024

I've just tried again to use an ec2 fleet and it seems to be working fine again now. I haven't changed absolutely anything about it so I've no idea what happened.

@ajax-koval-i
Copy link

I've just tried again to use an ec2 fleet and it seems to be working fine again now. I haven't changed absolutely anything about it so I've no idea what happened.

did you use asg or ec2 fleet ?

@tofanadrian3000
Copy link
Author

All my Jenkins clouds are created as "Amazon EC2 Fleet" and in the "EC2 Fleet" input field, I'm selecting between my AWS ASGs.

@ajax-koval-i
Copy link

@tofanadrian3000 did you use this cloud with freestyle projects ?
or maybe with pipeline?

@tofanadrian3000
Copy link
Author

Pipelines

@lukolszewski
Copy link

Same issue, does anyone have any workarounds? This is making our jenkins unusable now. I tried recreating fleets, restarting Jenkins. Nothing helps. We have a very similar config and same situation. Our logs fill up with:

024-03-15 12:09:44.074+0000 [id=60] INFO c.a.jenkins.ec2fleet.CloudNanny#doRun: Error during fleet 'XXXXXX' stats update
java.lang.NullPointerException
at com.amazon.jenkins.ec2fleet.EC2FleetCloud.updateByState(EC2FleetCloud.java:634)
at com.amazon.jenkins.ec2fleet.EC2FleetCloud.update(EC2FleetCloud.java:512)
at com.amazon.jenkins.ec2fleet.CloudNanny.doRun(CloudNanny.java:57)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:92)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)

@lukolszewski
Copy link

Normal Jenkins restart doesn't make any difference. Hard restart (using systemctl) seems to make the problem go away (who knows for how long).

@murtaza64
Copy link

We faced this issue and this is what turned out to be the solution for us:

We noticed that the CloudNanny errors in the log coincided with the time that we had deleted another EC2 Fleet Cloud via the Jenkins UI, but there were still three nodes from that cloud connected to the cluster. We manually deleted those three nodes from the Jenkins UI, and then the other EC2 Fleet cloud that was having issues started connecting agents again.

@mrsombre
Copy link

mrsombre commented Aug 8, 2024

Today our fleets have stuck as described in this issue. Method @murtaza64 suggested helps. I manually deleted some nodes and jenkins become operating again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants