Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add a KEP for a Kubernetes native API for pipelines #11551

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mprahl
Copy link
Contributor

@mprahl mprahl commented Jan 22, 2025

Description of your changes:

This proposesd a Kubernetes native API for pipelines and pipeline versions. This uses the KEP format in anticipation of #11535.

Here is a demo of a prototype with this design.

Checklist:

@mprahl
Copy link
Contributor Author

mprahl commented Jan 22, 2025

/cc @chensun @zijianjoy

Could you please review this when you get a chance?

@google-oss-prow google-oss-prow bot requested a review from chensun January 22, 2025 19:52

#### Caveats

Migrations from storing the pipelines and pipeline versions in the database to the Kubernetes API will lead to the same
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth adding an annotation on the pipeline and pipeline version to put the old UID so that the UI will still map to the proper pipeline version? My concern is that the UID wouldn't really be a unique identifier anymore.


### Non-Goals

1. Kubeflow pipeline runs will not be included in the Kubernetes native API at this time. It is a desire to eventually
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we see any tradeoffs to keep different approach for Pipeline and Runs ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial thoughts on a Kubernetes native API for runs would be an ephemeral (with a configurable TTL) object that proxies the request to the REST API. The emphemeral object would have some status for convenience along with a URL to the REST API or dashboard to see more details on the run. This would allow the convenience of starting a run from a Kubernetes API but not clutter Kubernetes' etcd with 100k+ objects.

So if we assume that approach, then this would require a new controller, which the proposal for pipelines and pipeline versions here does not. So I think the tradeoff is the maintenance burden and architecture is more complex, but the user experience is better for the pipelines and pipeline versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But when user creates a run, don't we create dedicated Argo Workflow which still clutter etcd ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I agree that it's a problem. This issue is supposed to help solve that specific regression in v2: #10899

Kubernetes native API approach.
1. The Kubeflow Pipelines Python SDK is enhanced to allow the output format to be a Kubernetes native API manifest.

### Non-Goals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think this would also be a good candidate.

Do you think it's best to include it in this proposal or in a future proposal that would also account for runs? From an implementation point of view, I think it fits in nicely with this proposal but logically, the experiments are grouping runs so it may make sense to do that when the user experience of runs through a Kubernetes API is proposed.

I'd also like your opinion on this @HumairAK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the similarities between the runs and experiments APIs, I'd leave experiments for the future proposal. As the focus in this first proposal is not runs, we may not have enough awareness about the APIs and non-functional requirements. This can lead to an experiments API that is not compatible/similar to runs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could design APIs that are future-proof.
Which means if we introduce CRDs for runs and experiment in the future, we won't make breaking changes to the Pipeline CRD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I don't expect the pipeline CRDs proposed here to need changes since they are following the existing database schema.

1. This will not support the v1 API of Kubeflow pipelines. To continue using the v1 API, the user must continue using
the default database solution.

## Proposal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kubeflow/wg-pipeline-leads Do we have any docs/proposals on why initially KFP decided to go with custom db approach instead of using etcd and Kubernetes CRDs?
Maybe we can learn something from previous proposals: https://docs.google.com/document/d/1fHU29oScMEKPttDA1Th1ibImAKsFVVt2Ynr4ZME05i0/edit?tab=t.0#heading=h.j4rlx69s50vy
or @Bobgy @Ark-kun can point us to the initial docs for KFP design.

Kubernetes API.

In multiuser mode, the existing RBAC for the REST API performs a subject access review to see if the user has access to
the `pipelines` resource in the `pipelines.kubeflow.org` API group in the Kubernetes namespace. This will be the same
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How we can deal with multi-tenancy when multiple users using the same Kubernetes namespace ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this proposal changes that because the REST API already uses Kubernetes RBAC for authorization which is namespace scoped. In other words, the user must have access to the pipelines.pipelines.kubeflow.org resource in the namespace. The difference today is that this resource does not actually exist and is just a convenient way to not reimplement RBAC in KFP. One of the proposed CRDs would add this resource.

If I didn't answer your question, could you please rephrase it or add an example?

Copy link
Member

@andreyvelich andreyvelich Feb 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, my question was about a use-case when multiple users create Pipelines in the same Kubernetes namespace. The Kubernetes RBAC is namespace-scoped, which means we can't give more fine grained ACLs over resources. For example, user can't call Kubernetes API-server to list only his/her personal Pipelines.
To solve it need to build attribute-based access control (ABAC) on top of Kubernetes RBAC using various tools (e.g. Istio, OPA).

So my question was with this new CRD-based model how do we view multi-tenancy and user-isolation for Kubernetes resources in the same namespace ?

cc @shravan-achar @bigsur0 @juliusvonkohout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have more fine grained RBAC if the Kubernetes role specifies resourceNames but I don't think it's practical to do except for specialized cases. In general, I think separate namespaces is best for separate tenants.

This does bring up the question of shared pipelines in multiuser mode. Right now, the namespace is stored as empty for those pipelines which doesn't work for this proposal.

I see two options for shared pipelines in multiuser mode:

  1. Configure a special namespace for shared pipelines (e.g. kubeflow-shared-pipelines) and allow the user to override which namespace this is with a CLI argument to the pipeline server.
  2. Define duplicate cluster scoped pipeline (e.g. SharedPipeline) and pipeline version (e.g. SharedPipelineVersion) CRDs for these shared pipelines.

I personally prefer option 1, but would like others' opinion.

Here is an example `Pipeline` manifest:

```yaml
apiVersion: pipelines.kubeflow.org/v2beta1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to keep pipelines. in the API version ?
I don't see any conflicts with other Kubeflow projects if we just use: kubeflow.org/v2beta1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point. Is there concern for conflicts on other resources? I see the existing
scheduledworkflows.kubeflow.org CRD does not use the pipelines. prefix so there is prior art of not including it.

My main intent was to match what is done today for the Kubernetes RBAC restrictions that guard the REST API endpoints (SubjectAccessReview on pipelines.pipelines.kubeflow.org). We wouldn't need the user to have any additional permissions when migrating to the Kubernetes API approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good point, that if we just keep group as kubeflow.org, we might hit naming conflict between projects.
I know that, Kubernetes SIGs use project name in the Group as well:
https://github.com/kubernetes-sigs/kueue/blob/main/apis/kueue/v1beta1/groupversion_info.go#L29
https://github.com/kubernetes-sigs/jobset/blob/main/api/jobset/v1alpha2/groupversion_info.go#L26

@kubeflow/wg-training-leads @Electronic-Waste @franciscojavierarceo @astefanutti @deepanker13 @saileshd1402 @seanlaii @helenxie-bit @astefanutti @varshaprasad96 should we rename our group to the trainer.kubeflow.org for the Kubeflow Trainer V2?


```yaml
apiVersion: pipelines.kubeflow.org/v2beta1
kind: Pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to see the full Go schema for Pipeline and PipelineVersion CRDs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the Go types in the last push in a new section. I hope this helps!

The Kubernetes generated `metadata.uid` field is the pipeline UID that other Kubeflow Pipeline artifacts will refer to.

Here is an example `PipelineVersion` manifest:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to have PipelineVersion CRD if we can do version control via Rollouts, similar to Deployment: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-back-to-a-previous-revision

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an interesting idea, but my understanding is that Kubernetes abstracts this with revisions stored as ReplicaSets objects. So we'd need to have at least one CRD that is user defined and one that is an implementation detail.

I think it's likely a simpler user experience to try and match the REST API's current model.

@mprahl mprahl force-pushed the add-kube-api-design branch from 30ee468 to e85771b Compare January 29, 2025 21:38
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zijianjoy for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mprahl mprahl force-pushed the add-kube-api-design branch from e85771b to a5e609e Compare January 29, 2025 21:45
@mprahl
Copy link
Contributor Author

mprahl commented Jan 29, 2025

@andreyvelich thanks so much for the review and good comments! I'm sorry but when I rebased, your comments no longer show up inline but I'm still paying attention to them.

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Jan 31, 2025

I really like the Idea. We need to get the pipeline definitions out of minio and mysql. That will also help with multi-tenancy. Scheduledworkflows / recurring runs also do not properly support it. They are half Kubernetes resource, half in the database.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed this somewhere but I am curious - why not using Argo Workflows CRDs directly that already work well with ArgoCD?

@mprahl
Copy link
Contributor Author

mprahl commented Feb 3, 2025

Maybe I missed this somewhere but I am curious - why not using Argo Workflows CRDs directly that already work well with ArgoCD?

@terrytangyuan here are the main reasons in my mind why this wasn't the chosen path:

  • The Kubeflow Pipelines API is supposed to be workflow/pipeline engine agnostic. The intermediate representation of the YAML currently works for Argo Workflows, Tekton (deprecated), and I believe Google Vertex.
  • The compiled Argo Workflow produced by Kubeflow Pipelines is an implementation detail with several tasks that are Kubeflow Pipelines specific such as the driver pods or wrapping the user code with the launcher. This being an implementation detail allows compilation bugs to be fixed without the user needing to modify their pipeline definitions.
  • The compilation step provides a lot of abstraction to make artifact/output passing to each step simple without the user having to understand Kubernetes.
  • The compiled Argo Workflow is currently unreadable to the average Kubeflow Pipelines user and so it'd be very difficult to maintain or to have code reviews on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants