Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve operator stability #23

Open
adwk67 opened this issue Mar 23, 2022 · 1 comment
Open

Improve operator stability #23

adwk67 opened this issue Mar 23, 2022 · 1 comment

Comments

@adwk67
Copy link
Member

adwk67 commented Mar 23, 2022

As a user of the spark-k8s-operator I want to be able to rely on my operator to discontinue reconciliation attempts and to clean up pods from finished jobs after a defined TTL.

  • define retry limit for job (init-job, driver and executor)
  • configure TTL for job and driver pods for clean up?
  • controller stops responding for indeterminate reasons (not does not reconcile)
@adwk67
Copy link
Member Author

adwk67 commented May 30, 2022

Current implementation:

  • Job times out after 600s - do we want to make this configurable?
  • Job is re-tried 6 times on failure (the k8s default for backoff_limit) - do we want to make this configurable or change this to re-starts (currently set to Never)?
  • there is no value set for active_deadline_seconds: should this be changed?
  • the ConfigMaps for the job, driver and executors are deleted when the parent Resource is deleted, but not before (such as on completion): change this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant