Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jenkins cannot create slave agent in EKS 1.29 #1017

Open
KosShutenko opened this issue Feb 20, 2024 · 3 comments
Open

Jenkins cannot create slave agent in EKS 1.29 #1017

KosShutenko opened this issue Feb 20, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@KosShutenko
Copy link

KosShutenko commented Feb 20, 2024

Describe the bug

I have EKS cluster with 1.29 version.
I've installed Jenkins helm chart (latest version) via FluxCD. From values I've changed Ingress only.
After Jenkins installation I've tested Kubernetes connections and Its OK.

But test-job with default (proposed by Jenkins) pipeline cannot be executed.
I don't see any pods with agent started in Jenkins namespace.

In console output I see:

Started by user Jenkins Admin
[Pipeline] Start of Pipeline
[Pipeline] podTemplate
[Pipeline] {
[Pipeline] node
Still waiting to schedule task
‘test-job-1-fd6tk-tcgf3-w2tts’ is offline
ERROR: Failed to launch test-job-1-fd6tk-tcgf3-w2tts
java.io.IOException: Canceled
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:72)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
	at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
Caused: java.io.InterruptedIOException: timeout
	at okhttp3.internal.connection.RealCall.timeoutExit(RealCall.kt:398)
	at okhttp3.internal.connection.RealCall.callDone(RealCall.kt:360)
	at okhttp3.internal.connection.RealCall.noMoreExchanges$okhttp(RealCall.kt:325)
	at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:209)
	at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517)
Caused: java.io.IOException: timeout
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:504)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:340)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:753)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:97)
	at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42)
Caused: io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:129)
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:122)
	at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:44)
	at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:133)
	at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297)
	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

In jenkins-0 pod (jenkins container) I see the following logs:

2024-02-20 07:23:31.605+0000 [id=104]	INFO	hudson.slaves.NodeProvisioner#update: test-job-1-fd6tk-tcgf3-w2tts provisioning successfully completed. We have now 2 computer(s)
2024-02-20 07:23:31.640+0000 [id=103]	INFO	o.c.j.p.k.pod.retention.Reaper#watchCloud: set up watcher on kubernetes
2024-02-20 07:26:36.021+0000 [id=103]	WARNING	o.c.j.p.k.KubernetesLauncher#launch: Kubernetes returned unhandled HTTP code -1 null
2024-02-20 07:26:36.129+0000 [id=103]	WARNING	o.c.j.p.k.KubernetesLauncher#launch: Error in provisioning; agent=KubernetesSlave name: test-job-1-fd6tk-tcgf3-w2tts, template=PodTemplate{id='8b8ce8ed-a266-4ae3-8795-2243187ec290', name='test-job_1-fd6tk-tcgf3', namespace='jenkins', label='test-job_1-fd6tk', annotations=[PodAnnotation{key='buildUrl', value='http://jenkins.jenkins.svc.cluster.local:8080/job/test-job/1/'}, PodAnnotation{key='runUrl', value='job/test-job/1/'}]}
java.io.IOException: Canceled
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:72)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
	at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
Caused: java.io.InterruptedIOException: timeout
	at okhttp3.internal.connection.RealCall.timeoutExit(RealCall.kt:398)
	at okhttp3.internal.connection.RealCall.callDone(RealCall.kt:360)
	at okhttp3.internal.connection.RealCall.noMoreExchanges$okhttp(RealCall.kt:325)
	at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:209)
	at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517)
Caused: java.io.IOException: timeout
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:504)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:340)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:753)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:97)
	at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42)
Caused: io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:129)
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:122)
	at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:44)
	at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:133)
	at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297)
	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

Version of Helm and Kubernetes

- Helm: v3.14.0"
- Kubernetes: v1.27.2

Chart version

jenkins-5.0.13

What happened?

1. Install Jenkins helm chart on EKS 1.29
2. Create test pipeline (Kubernetes) job
3. Check logs

What you expected to happen?

I have another Jenkins installations on 1.26-1.27 GKE clusters and it works fine. Jenkins creates agents pods and exec pipelines.

How to reproduce it

controller:
      ingress:
        enabled: true
        apiVersion: "networking.k8s.io/v1"
        annotations:
          nginx.ingress.kubernetes.io/rewrite-target: /
          cert-manager.io/cluster-issuer: letsencrypt-dns-prod
          nginx.ingress.kubernetes.io/server-snippets: |
            location / {
              proxy_set_header Upgrade $http_upgrade;
              proxy_http_version 1.1;
              proxy_set_header X-Forwarded-Host $http_host;
              proxy_set_header X-Forwarded-Proto $scheme;
              proxy_set_header X-Forwarded-For $remote_addr;
              proxy_set_header Host $host;
              proxy_set_header Connection "upgrade";
              proxy_cache_bypass $http_upgrade;
              }
        hostName: jenkins.cloud.company.pro
        tls:
          - secretName: tls-secret-jenkins-cloud-company-pro
            hosts:
              - jenkins.cloud.company.pro

Anything else we need to know?

No response

@KosShutenko KosShutenko added the bug Something isn't working label Feb 20, 2024
@jpriebe
Copy link

jpriebe commented Aug 6, 2024

@KosShutenko - have you found a workaround for this? We upgraded EKS to 1.29 yesterday, and we are seeing the exact same errors.

Actually, here's a little more info - we were already running EKS 1.29, and jenkins was working. But we updated our nodes from Amazon Linux 2 to Amazon Linux 2023, and we are getting the same error you documented.

@timja
Copy link
Member

timja commented Aug 7, 2024

I would raise this with the kubernetes-plugin, it doesn't look related to the helm chart

@jpriebe
Copy link

jpriebe commented Aug 7, 2024

@KosShutenko - have you found a workaround for this? We upgraded EKS to 1.29 yesterday, and we are seeing the exact same errors.

Actually, here's a little more info - we were already running EKS 1.29, and jenkins was working. But we updated our nodes from Amazon Linux 2 to Amazon Linux 2023, and we are getting the same error you documented.

Quick update on this -- it turns out that it was a new cluster component we had added recently -- the vertical pod autoscaler (https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler).

The VPA installs an admission controller webhook. It was that webhook that was timing out, causing the pod creation API call to timeout.

@KosShutenko - your problem may not be the VPA, but you might want to look at all your mutating webhooks:

kubectl get MutatingWebhookConfiguration -A

I would also suggest you look at your kubernetes API logs in cloudwatch for more clues. In my case, I found log entries like this:

Failed calling webhook, failing open vpa.k8s.io: failed calling webhook "vpa.k8s.io": failed to call webhook: Post "
[https://vpa-webhook.vpa.svc:443/?timeout=30s":](https://vpa-webhook.vpa.svc/?timeout=30s%22:)
net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

that led me to identify VPA as the culprit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants