-
Notifications
You must be signed in to change notification settings - Fork 42
Pods scheduled stuck in Pending state #150
Comments
Unfortunately spark scheduler extender currently doesn't support launching client mode applications to kubernetes. It assumes that a driver will be launched in the cluster, which then proceeds to request executors. That being said, I think your executor pods are failing to be scheduled before consulting with the extender though, as the message says If you have fixed that by increasing your pod limit or killing existing pods, then I would expect your pods to be still stuck at pending, but with a message telling you something like for your second question, I think it is got to do with a network problem from the health probe into your container, because the message for the stuck pod indicates that kube-scheduler considered that pod, hence is operational |
That nuance wasnt clear to me but now that it is I think I can work with this. Good to know thanks.
This is actually how AWS fargate works as a resource negotiator. Hardware is allocated on-demand, always one-node-per-pod. For example, say spark requests resources for a new executor. This of course begets a request to kubernetes for an executor pod. In the case of fargate this begets a request to allocate a new VM just-in-time for the lifetime of the executor, billed by the second. In 60-90 seconds (usually) fargate returns a new VM with kubernetes tooling pre-installed/configured sized to the request plus some extra RAM for kubelet. When running
I dont have a good response for this point. Within the VLAN containing nodes there are no current restrictions for cross-node communication. It seems the "connection refused" errors come from requests where the client and server are the same IP. This might be an oversight I can find by looking closer. |
I'm attempting to run spark-thriftserver using this scheduler extender. If you're not familiar, spark-thriftserver runs in client mode (local driver, remote executors). The thrift server exposes a JDBC connection which receives queries and turns these into spark jobs.
The command to run this looks like:
spark-defaults.conf looks like:
So far, I've applied the extender.yaml file as-is without any modifications. This instantiates two new pods under the spark namespace both in Running state with names starting with "spark-scheduler-".
describe pod XXX
yields some troubling information about them:When I attempt to run the driver above (which launches properly), because the
spark.dynamicAllocation.minExecutors
is set to1
the driver immediately requests a single executor pod at startup. The pod itself remains indefinitely in a pending state.describe pod XXX
seems to suggest that no nodes satisfy the pod's scheduling criteria:What I'm having trouble figuring out is:
instance-group
labels, nor any custom labels. All the nodes accept the spark namespace. sorry to ask but I am struggling to find the proper steps to take to narrow down the issue.If it helps, this is using aws fargate as the compute resources behind kubernetes, but based on what i know so far that shouldnt be an issue.
The text was updated successfully, but these errors were encountered: