Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleet with shutdownOnIdle and inbound launcher fails to scale out due to the preserved temporaryOffline state of stopped agents #633

Closed
psciborek opened this issue Feb 11, 2025 · 8 comments

Comments

@psciborek
Copy link
Contributor

Jenkins and plugins versions report

Environment

Jenkins: 2.492.1
OS: Linux - 4.18.0-553.37.1.el8_10.x86_64
Java: 17.0.14 - Red Hat, Inc. (OpenJDK 64-Bit Server VM)

[...]
azure-credentials:343.vd80f9c4859df
azure-sdk:191.v53ec8913ee10
azure-vm-agents:1001.vf3448fe27897
[...]

What Operating System are you using (both controller, and any agents involved in the problem)?

Controller and Agents: Rocky Linux 8
Jenkins 2.492.1 with azure-vm-agents:1001.vf3448fe27897

Reproduction steps

  1. Create an Azure fleet with a JNLP launcher and configure it to shutdown agents instead of deleting them.
  2. Run a test job to provision an agent.
  3. Wait until the agent stops.
  4. Run a test job to verify that the agent starts successfully.

Expected Results

Agent should be online.

Actual Results

Agent is offline:

Agent azure-fleet61c7b0
Offline because computer was idle; it will be relaunched when needed.
Agent is connected.

Status:

Agent: azure-fleet61c7b0, label: azure-label, mode: NORMAL, isOnline: false, isOffline: true, hasChannel: true, isTemporarilyOffline: true ,suspended: true, Working Executors: 0
Status check script
for (i in hudson.model.Hudson.instance.slaves) {
  hasChannel = i.getComputer().getChannel() != null
  println "Agent: ${i.name}, label: ${i.labelString}, mode: ${i.getMode()}, isOnline: ${i.getComputer().isOnline()}, isOffline: ${i.getComputer().isOffline()}, hasChannel: ${hasChannel}, isTemporarilyOffline: ${i.getComputer().isTemporarilyOffline()} ,suspended: ${!i.getComputer().isAcceptingTasks()}, Working Executors: ${i.getComputer().countBusy()}";
}

Logs

2025-02-11 12:25:05.931+0000 [id=8950]	INFO	c.m.azure.vmagent.AzureVMCloud#lambda$provision$1: Found existing node, starting VM azure-fleet61c7b0
2025-02-11 12:25:05.931+0000 [id=8950]	INFO	c.m.a.v.AzureVMManagementServiceDelegate#startVirtualMachine: Starting: azure-fleet61c7b0
2025-02-11 12:25:37.635+0000 [id=8956]	INFO	h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #14 from /10.1.2.3:35426 failed: null
2025-02-11 12:25:37.693+0000 [id=8957]	INFO	h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #15 from /10.4.0.24:35428
2025-02-11 12:25:58.046+0000 [id=8950]	INFO	c.m.azure.vmagent.AzureVMCloud#waitUntilJNLPNodeIsOnline: Azure Cloud: waitUntilOnline: for agent azure-fleet61c7b0

Anything else?

It looks like the agent hangs on waitUntilJNLPNodeIsOnline due to Temporary Offline status .

This may be reloaded to a change in Jenkins 2.479.3: Retain user-generated offline reason when agent connects or disconnects for technical reasons. pull 9855, JENKINS-30101, JENKINS-30175

Are you interested in contributing a fix?

No response

@psciborek
Copy link
Contributor Author

The IP addresses in the logs are inconsistent due to my copy-paste error, but otherwise, they are correct.

Additionally, clicking 'Bring this node back online' on the agent's status page resolves the issue.

Maybe something similar to the following could fix the issue:

diff --git a/src/main/java/com/microsoft/azure/vmagent/AzureVMCloud.java b/src/main/java/com/microsoft/azure/vmagent/AzureVMCloud.java
index 978fdf3..bf77c6b 100644
--- a/src/main/java/com/microsoft/azure/vmagent/AzureVMCloud.java
+++ b/src/main/java/com/microsoft/azure/vmagent/AzureVMCloud.java
@@ -694,6 +694,7 @@ public class AzureVMCloud extends Cloud {
                                                     if (agentNode.getAgentLaunchMethod().equalsIgnoreCase("SSH")) {
                                                         retrySshConnect(azureComputer);
                                                     } else { // Wait until node is online
+                                                        azureComputer.setTemporaryOfflineCause(null)
                                                         waitUntilJNLPNodeIsOnline(agentNode);
                                                     }
                                                     LOGGER.info(String.format("Remove suspended status for node: %s",

@timja
Copy link
Member

timja commented Feb 11, 2025

Thanks, if you're able to test that change it will be useful

@psciborek
Copy link
Contributor Author

I've manually tested the change, and it works for me (aside from the missing semicolon).

--- a/src/main/java/com/microsoft/azure/vmagent/AzureVMCloud.java
+++ b/src/main/java/com/microsoft/azure/vmagent/AzureVMCloud.java
@@ -694,6 +694,7 @@ public class AzureVMCloud extends Cloud {
                                                     if (agentNode.getAgentLaunchMethod().equalsIgnoreCase("SSH")) {
                                                         retrySshConnect(azureComputer);
                                                     } else { // Wait until node is online
+                                                        azureComputer.setTemporaryOfflineCause(null);
                                                         waitUntilJNLPNodeIsOnline(agentNode);
                                                     }
                                                     LOGGER.info(String.format("Remove suspended status for node: %s",

@timja
Copy link
Member

timja commented Feb 12, 2025

mind submitting a pull request?

@timja
Copy link
Member

timja commented Feb 12, 2025

There's probably no need to do this separately for SSH and inbound agents is there?

@psciborek
Copy link
Contributor Author

Sure, I will create a PR tomorrow (and retest to verify the case with the SSH launcher).

@psciborek
Copy link
Contributor Author

It seems this issue has become more complicated. Let me share current status:

  • diff:
--- a/src/main/java/com/microsoft/azure/vmagent/AzureVMCloud.java
+++ b/src/main/java/com/microsoft/azure/vmagent/AzureVMCloud.java
@@ -691,6 +691,7 @@ public class AzureVMCloud extends Cloud {
                                                     getServiceDelegate().setVirtualMachineDetails(
                                                             agentNode, template);
                                                     Jenkins.get().addNode(agentNode);
+                                                    azureComputer.setTemporaryOfflineCause(null);
                                                     if (agentNode.getAgentLaunchMethod().equalsIgnoreCase("SSH")) {
                                                         retrySshConnect(azureComputer);
                                                     } else { // Wait until node is online
  • For initial agent creation, the VM shuts down before it connects to Jenkins (for ssh launcher)
Logs
Creating 1 nodes from template azure-fleet, currently have 0 VMs of this template, currently have 0 VMs in cloud
Feb 13, 2025 2:12:21 PM INFO com.microsoft.azure.vmagent.AzureVMCloud provision
1 planned node(s)
Feb 13, 2025 2:12:21 PM FINE com.microsoft.azure.vmagent.AzureVMNoDelayProvisionerStrategy
Planned 1 new nodes
[...]
Feb 13, 2025 2:13:02 PM FINE com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask
Finished Azure VM Agents Clean Task. 818 ms
Feb 13, 2025 2:13:03 PM FINE com.microsoft.azure.vmagent.AzureVMCloud
VM available: azure-fleet613dc0
Feb 13, 2025 2:13:03 PM INFO com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate parseResponse
Deployment response: 
  found agent azure-fleet613dc0
  OS type Linux
  number of executors 1
Feb 13, 2025 2:13:04 PM INFO com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate setVirtualMachineDetails
The Azure agent doesn't have a public IP. Will use the private IP
Feb 13, 2025 2:13:04 PM FINE com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate
Azure agent details:
nodeName=azure-fleet613dc0
adminUserName=test-ssh-key
shutdownOnIdle=true
retentionTimeInMin=0
labels=azure-label
Feb 13, 2025 2:13:04 PM FINE com.microsoft.azure.vmagent.AzureVMCloud
Adding agent azure-fleet613dc0 to Jenkins nodes
Feb 13, 2025 2:13:04 PM FINE com.microsoft.azure.vmagent.AzureVMAgent
Starting agent azure-fleet613dc0
Feb 13, 2025 2:13:04 PM INFO com.microsoft.azure.vmagent.AzureVMCloudRetensionStrategy start
Starting azureComputer azure-fleet613dc0
Feb 13, 2025 2:13:04 PM INFO com.microsoft.azure.vmagent.AzureVMAgent shutdown
Add suspended status for node azure-fleet613dc0
Feb 13, 2025 2:13:04 PM FINE com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher
launching agent azure-fleet613dc0
Feb 13, 2025 2:13:04 PM INFO com.microsoft.azure.vmagent.AzureVMAgent shutdown
Shutting down agent azure-fleet613dc0
Feb 13, 2025 2:13:04 PM INFO com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate shutdownVirtualMachine
shutdown called for azure-fleet613dc0
Feb 13, 2025 2:13:04 PM FINE com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate
Status PowerState/running
Feb 13, 2025 2:13:04 PM FINE com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher
Start connecting to SSH

VM is stopped while the job that triggered scaling up is still in the queue

My settings for retention:

retentionStrategy:
  azureVMCloudRetentionStrategy:
    idleTerminationMinutes: 5
  • Most likely, there is another bug: a public IP is being used despite the usePrivateIP setting.

(For certain reasons, I would like the agent to have a public IP but use the internal network for Jenkins connections. Adding public IP is implemented as a custom image feature.)

Fix? (untested)
--- a/src/main/java/com/microsoft/azure/vmagent/AzureVMManagementServiceDelegate.java
+++ b/src/main/java/com/microsoft/azure/vmagent/AzureVMManagementServiceDelegate.java
@@ -1169,9 +1169,9 @@ public final class AzureVMManagementServiceDelegate {
       String publicIPStr = "";
       String privateIP = vm.getPrimaryNetworkInterface().primaryPrivateIP();
       String fqdn;
-        if (publicIP == null) {
+        if (publicIP == null || template.getUsePrivateIP()) {
           fqdn = privateIP;
-            LOGGER.log(Level.INFO, "The Azure agent doesn't have a public IP. Will use the private IP");
+            LOGGER.log(Level.INFO, "The Azure agent doesn't have a public IP or usePrivateIP is set. Will use the private IP");
       } else {
           fqdn = publicIP.fqdn();
           publicIPStr = publicIP.ipAddress();
  • I'm not sure if any of my changes have impacted the retention strategy, especially the one related to the "reuse existing nodes if available" code block (my tests were conducted without the private IP fix). I’ll try to figure this out tomorrow due to a lack of time today.

@timja
Copy link
Member

timja commented Feb 22, 2025

Fixed in #636

@timja timja closed this as completed Feb 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants