-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job pod failed to start on GKE Autopilot with container hooks (kubernetes mode) #152
Comments
Hello! Thank you for filing an issue. The maintainers will triage your issue shortly. In the meantime, please take a look at the troubleshooting guide for bug reports. If this is a feature request, please review our contribution guidelines. |
I also tried the same config with GKE standard cluster and I'm running into actions/actions-runner-controller#3132. |
Hey @knkarthik, I'm not sure that you are using the right service account. You should not use the service account of the controller, but rather the service account for the service account with the permissions you posted. |
Thanks for the reply and sorry to confuse you @nikola-jokic. I'm indeed using The following is actually commented out in my values file but in my post it was not. I've removed it from my original post now to make it clear.
|
Can you please monitor the cluster and run kubectl describe when the workflow pod is created? |
@nikola-jokic I did some digging and unfortunately, the pod appears for < 1s and I'm not able to describe it. However, when I run
|
@knkarthik, not sure if it is just that, but i managed to pass in resources for a gpu job with a confimap very similar to yours, just removing the comments on the $job name line.. i don't know if you added that just here, but might be worth trying without it.. mine looks like this. apiVersion: v1
kind: ConfigMap
metadata:
name: pod-templates
data:
default.yaml: |
---
apiVersion: v1
kind: PodTemplate
metadata:
annotations:
annotated-by: "extension"
labels:
labeled-by: "extension"
spec:
containers:
- name: $job
resources:
limits:
nvidia.com/gpu: "1"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4 |
Checks
Controller Version
0.8.3
Deployment Method
Helm
Checks
To Reproduce
runner-scale-set-values.yaml
pod-template.yaml
rbac,yaml
Describe the bug
I can see that a runner pod is created but it failed to create the job pod with the message
Error: pod failed to come online with error: Error: Pod gke-autopilot-4vvrh-runner-74czb-workflow is unhealthy with phase status Failed"
Describe the expected behavior
I expected it to create a job pod.
Additional Context
It works if I don't try to customize the job pod ie if I use a config like below. But I want to give more resources to the actual pod that's running the job so I need to use pod-templates to customize it.
Controller Logs
No errors, just regular logs. I can provide it if required.
Runner Pod Logs
The text was updated successfully, but these errors were encountered: