Improve operator stability #23

adwk67 · 2022-03-23T13:39:15Z

As a user of the spark-k8s-operator I want to be able to rely on my operator to discontinue reconciliation attempts and to clean up pods from finished jobs after a defined TTL.

define retry limit for job (init-job, driver and executor)
configure TTL for job and driver pods for clean up?
controller stops responding for indeterminate reasons (not does not reconcile)

adwk67 · 2022-05-30T14:11:46Z

Current implementation:

Job times out after 600s - do we want to make this configurable?
Job is re-tried 6 times on failure (the k8s default for backoff_limit) - do we want to make this configurable or change this to re-starts (currently set to Never)?
there is no value set for active_deadline_seconds: should this be changed?
the ConfigMaps for the job, driver and executors are deleted when the parent Resource is deleted, but not before (such as on completion): change this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve operator stability #23

Improve operator stability #23

adwk67 commented Mar 23, 2022

adwk67 commented May 30, 2022

Improve operator stability #23

Improve operator stability #23

Comments

adwk67 commented Mar 23, 2022

adwk67 commented May 30, 2022