Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial controller implementation #5

Merged
merged 5 commits into from
Apr 6, 2023

Conversation

danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Apr 5, 2023

Changes included in this PR:

  • Initial controller implementation for a simplified JobSet API, which only includes the Jobs list and network configuration.
  • controller checks status of active jobs owned by JobSet, and deletes up any old failed jobs (later we can add failure policy for restarting all jobs etc.)
  • controller creates any jobs which do not exist yet or have failed

Testing:

  • Tested manually with JobSet for distributed PyTorch training job
  • Unit tests for util functions
  • I have a simple integration test ready to add in the next PR.

Note: I looked into adding unit tests for the JobSetReconciler methods themselves, but it seems I would need to possibly fake/mock a Manager to use for creating a unit-testable JobSetReconciler, and I noticed in Kueue there are no reconciler specific unit tests (see here), and the reconciler testing is in the integration tests, so I am planning on doing the same. @ahg-g let me know if you have any thoughts on this.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 5, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielvegamyhre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 5, 2023
@k8s-ci-robot k8s-ci-robot requested a review from ahg-g April 5, 2023 02:45
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 5, 2023
@danielvegamyhre danielvegamyhre force-pushed the init-controller branch 2 times, most recently from 5b28bb0 to 4c71f03 Compare April 5, 2023 03:43
Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't go through everything, but I think there is enough comments to address already :)

Makefile Show resolved Hide resolved
api/v1alpha1/jobset_types.go Outdated Show resolved Hide resolved
go.sum Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
api/v1alpha1/jobset_types.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Show resolved Hide resolved
@danielvegamyhre danielvegamyhre force-pushed the init-controller branch 2 times, most recently from cfbfd55 to 727c08d Compare April 5, 2023 21:07
api/v1alpha1/jobset_types.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
api/v1alpha1/jobset_types.go Outdated Show resolved Hide resolved
api/v1alpha1/jobset_types.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/util.go Outdated Show resolved Hide resolved
pkg/controllers/util.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, a couple of minor comments, I guess next you will send the integration test PR, right?

pkg/controllers/jobset_controller.go Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Show resolved Hide resolved
@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Apr 6, 2023

@ahg-g Thanks for the review, and yes I'll submit a PR for some basic integration tests next, then configure a presubmit to run the integration tests. Let me know if the PR looks good now and I'll squash the commits.

@ahg-g
Copy link
Contributor

ahg-g commented Apr 6, 2023

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 6, 2023
@ahg-g
Copy link
Contributor

ahg-g commented Apr 6, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 6, 2023
@k8s-ci-robot k8s-ci-robot merged commit 08af1af into kubernetes-sigs:main Apr 6, 2023
@danielvegamyhre danielvegamyhre deleted the init-controller branch April 6, 2023 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants