The Data Science Pipelines Operator (DSPO) is an OpenShift Operator that is used to deploy single namespace scoped Data Science Pipeline stacks onto individual OCP namespaces.
- Data Science Pipelines Operator
- Table of Contents
- Overview
- Quickstart
- Pre-requisites
- Deploy the Operator via ODH
- Deploy the standalone_Operator
- Deploy DSP instance
- DataSciencePipelinesApplication Component Overview
- Deploying Optional Components
- Using a DataSciencePipelinesApplication
- Using the Graphical UI
- Using the API
- Cleanup
- Cleanup ODH Installation
- Cleanup Standalone Installation
- Run tests
- Metrics
- Configuring Log Levels for the Operator
- Deployment and Testing Guidelines for Developers
Data Science Pipelines (DSP) allows data scientists to track progress as they iterate over development of ML models. With DSP, a data scientist can create workflows for data preparation, model training, model validation, and more. They can create and track experiements to arrive at the best version of of training data, model hyperparameters, model code, etc., and repeatably rerun these experiments.
Data Science Pipelines is based on the upstream Kubeflow Pipelines (KFP) project.
Data Scientists can use tools like the kfp SDK or Elyra to author their workflows, and interact with them in the ODH dashboard.
To get started you will first need to satisfy the following pre-requisites:
- An OpenShift cluster that is 4.11 or higher.
- You will need to be logged into this cluster as cluster admin via oc client.
- Based on which DSP version to install you will need to do the following:
- For DSPv1: The OpenShift Cluster must have OpenShift Pipelines 1.8 or higher installed. We recommend channel pipelines-1.8 on OCP 4.10 and pipelines-1.9 or pipelines-1.10 for OCP 4.11, 4.12 and 4.13. Instructions here.
- For DSPv2: The DSPO will install a namespace-scoped Argo Workflow Controller, so ensure the OpenShift Cluster does not have a cluster-scoped Argo Workflows installation already present.
- Based on installation type you will need one of the following:
- For Standalone method: You will need to have Kustomize version 4.5+ installed
- For ODH method: The Open Data Hub operator needs to be installed. You can install it via OperatorHub.
Use this method to deploy DSPO using the OpenDataHub operator
On a cluster with ODH installed, create a namespace where you would like to install DSPO:
Deploy the following DataScienceCluster
:
ODH_NS=opendatahub
oc new-project ${ODH_NS}
Then deploy the following DataScienceCluster
into the namespace created above:
cat <<EOF | oc apply -f -
kind: DataScienceCluster
apiVersion: datasciencecluster.opendatahub.io/v1
metadata:
name: data-science-cluster
namespace: ${ODH_NS}
spec:
components:
dashboard:
managementState: Managed
datasciencepipelines:
managementState: Managed
EOF
ℹ️ Note:
You can also deploy other ODH components using DataScienceCluster`. See https://github.com/opendatahub-io/opendatahub-operator#example-datasciencecluster for more information.
Confirm the pods are successfully deployed and reach running state:
oc get pods -n ${ODH_NS}
Once all pods are ready, we can proceed to deploying the first Data Science Pipelines (DSP) instance. Instructions here.
You can use custom manifests from a branch or tag to use a different Data Science Pipelines Operator image. To do so, modify the IMAGES_DSPO
config in config/base/params.env and push the changes to a branch or tag.
Create a (or edit the existent) DataSciencePipelines
adding devFlags.manifests
with the URL of your branch or tag. For example, given the following repository and branch:
- Repository:
https://github.com/a_user/data-science-pipelines-operator
- Branch:
my_branch
The DataSciencePipelines
YAML should look like:
kind: DataScienceCluster
apiVersion: datasciencecluster.opendatahub.io/v1
metadata:
name: data-science-cluster
namespace: ${ODH_NS}
spec:
components:
dashboard:
managementState: Managed
datasciencepipelines:
managementState: Managed
devFlags:
manifests:
- uri: https://github.com/a_user/data-science-pipelines-operator/tarball/my_branch
contextDir: config
sourcePath: base
Use this method to deploy DSPO without OpenDataHub operator
First clone this repository:
WORKING_DIR=$(mktemp -d)
git clone https://github.com/opendatahub-io/data-science-pipelines-operator.git ${WORKING_DIR}
If you already have the repository cloned, set WORKING_DIR=
to its absolute location accordingly.
DSPO can be installed in any namespace, we'll deploy it in the following namespace, you may update this environment variable accordingly.
ODH_NS=opendatahub
# Create the namespace if it doesn't already exist
oc new-project ${ODH_NS}
Now we will navigate to the DSPO manifests then build and deploy them to this namespace.
Run the following to deploy the ful DSPO v2 stack.
cd ${WORKING_DIR}
make deploy OPERATOR_NS=${ODH_NS}
Confirm the pods are successfully deployed and reach running state:
oc get pods -n ${ODH_NS}
Once all pods are ready, we can proceed to deploying the first Data Science Pipelines (DSP) instance. Instructions here.
We'll deploy the first instance in the following namespace.
# You may update this value to the namespace you would like to use.
DSP_Namespace=test-ds-project-1
oc new-project ${DSP_Namespace}
DSPO introduces a new custom resource to your cluster called DataSciencePipelinesApplication
. This resource is where you configure your DSP
components, your DB and Storage configurations, and so forth. For now, we'll use the sample one provided, feel free to
inspect this sample resource to see other configurable options.
cd ${WORKING_DIR}/config/samples/v2/dspa-simple
kustomize build . | oc -n ${DSP_Namespace} apply -f -
Note: the sample CR used here deploys a minio instance so DSP may work out of the box this is unsupported in production environments and we recommend to provide your own object storage connection details via spec.objectStorage.externalStorage see ${WORKING_DIR}/config/samples/v2/external-object-storage/dspa.yaml for an example.
Confirm all pods reach ready state by running:
oc get pods -n ${DSP_Namespace}
For instructions on how to use this DSP instance refer to these instructions: here.
You can use the DSPO to deploy multiple DataSciencePipelinesApplication
instances in different OpenShift namespaces, for example earlier we deployed a DataSciencePipelinesApplication
resource named sample
. We can use this again and deploy it to a different namespace:
DSP_Namespace_2=test-ds-project-2
oc new-project ${DSP_Namespace_2}
cd ${WORKING_DIR}/config/samples/v2/dspa-simple
kustomize build . | oc -n ${DSP_Namespace_2} apply -f -
Using DSPO you can specify custom credentials for Database and Object storage. If specifying external connections, this is required. You can also provide secrets for the built in MariaDB and Minio deployments. To see a sample configuration you can simply investigate and deploy the following path:
DSP_Namespace_3=test-ds-project-3
oc new-project ${DSP_Namespace_3}
cd ${WORKING_DIR}/config/samples/v2/custom-configs
kustomize build . | oc -n ${DSP_Namespace_3} apply -f -
Notice the introduction of 2 secrets
testdbsecret
, teststoragesecret
and 2 configmaps
custom-ui-configmap
and
custom-artifact-script
. The secrets
allow you to provide your own credentials for the DB and MariaDB connections.
These can be configured by the end user as needed.
To specify a custom Object Storage (example an AWS s3 bucket) you will need to provide DSPO with your S3 credentials in
the form of a k8s Secret
, see an example of such a secret here config/samples/v2/external-object-storage/storage-creds.yaml
.
DSPO can deploy a DSPA instance and use this S3 bucket for storing its metadata and pipeline artifacts. A sample
configuration for a DSPA that does this is found in config/samples/v2/external-object-storage
, you can update this as
needed, and deploy this DSPA by running the following:
DSP_Namespace_3=test-ds-project-4
oc new-project ${DSP_Namespace_4}
cd ${WORKING_DIR}/config/samples/v2/external-object-storage
kustomize build . | oc -n ${DSP_Namespace_3} apply -f -
When a DataSciencePipelinesApplication
is deployed, the following components are deployed in the target namespace:
- APIServer
- Persistence Agent
- Scheduled Workflow controller
If specified in the DataSciencePipelinesApplication
resource, the following components may also be additionally deployed:
- MariaDB
- Minio
- MLPipelines UI
- MLMD (ML Metadata)
To understand how these components interact with each other please refer to the upstream Kubeflow Pipelines Architectural Overview documentation.
To deploy a standalone MariaDB metadata database (rather than providing your own database connection details), simply add a mariaDB
item under the spec.database
in your DSPA definition with an deploy
key set to true
. All other fields are defaultable/optional, see All Fields DSPA Example for full details. Note that this component is mutually exclusive with externally-provided databases (defined by spec.database.externalDB
).
apiVersion: datasciencepipelinesapplications.opendatahub.io/v1
kind: DataSciencePipelinesApplication
metadata:
name: sample
spec:
...
database:
mariaDB: # mutually exclusive with externalDB
deploy: true
To deploy a Minio Object Storage component (rather than providing your own object storage connection details), simply add a minio
item under the spec.objectStorage
in your DSPA definition with an image
key set to a valid minio component container image. All other fields are defaultable/optional, see All Fields DSPA Example for full details. Note that this component is mutually exclusive with externally-provided object stores (defined by spec.objectStorage.externalStorage
).
apiVersion: datasciencepipelinesapplications.opendatahub.io/v1
kind: DataSciencePipelinesApplication
metadata:
name: sample
spec:
...
objectStorage:
minio: # mutually exclusive with externalStorage
deploy: true
# Image field is required
image: 'quay.io/opendatahub/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance'
To deploy the standalone DS Pipelines UI component, simply add a spec.mlpipelineUI
item to your DSPA with an image
key set to a valid ui component container image. All other fields are defaultable/optional, see All Fields DSPA Example for full details.
apiVersion: datasciencepipelinesapplications.opendatahub.io/v1
kind: DataSciencePipelinesApplication
metadata:
name: sample
spec:
...
mlpipelineUI:
deploy: true
# Image field is required
image: 'quay.io/opendatahub/odh-ml-pipelines-frontend-container:beta-ui'
To deploy the ML Metadata artifact linage/metadata component, simply add a spec.mlmd
item to your DSPA with deploy
set to true
. All other fields are defaultable/optional, see All Fields DSPA Example for full details.
apiVersion: datasciencepipelinesapplications.opendatahub.io/v1
kind: DataSciencePipelinesApplication
metadata:
name: sample
spec:
...
mlmd:
deploy: true
When a DataSciencePipelinesApplication
is deployed, use the MLPipelines UI endpoint to interact with DSP, either via a GUI or via API calls.
You can retrieve this route by running the following in your terminal:
DSP_CR_NAME=sample
DSP_Namespace=test-ds-project-1
echo https://$(oc get routes -n ${DSP_Namespace} ds-pipeline-ui-${DSP_CR_NAME} --template={{.spec.host}})
Note the UI presented below is the upstream Kubeflow Pipelines UI, this is not supported in DSP and will be replaced with the ODH Dashboard UI. Until then, this UI can be deployed via DSPO for experimentation/development purposes. Note however that this UI is not a supported feature of DSPO/ODH.
Navigate to the route retrieved in the last step. You will be presented with the MLPipelines UI. In this walkthrough we will upload a pipeline and start a run based off it.
To start, click the "Upload Pipeline" button.
Choose a file, you can use the flipcoin example. Download this example and select it for the first step. Then click "Create" for the second step.
Once done, you can now use this Pipeline
to create a Run
, do this by pressing "+ Create Run".
On this page specify the Pipeline
we just uploaded if it's not already auto selected. Similarly with the version,
with the example there will only be one version listed. Give this Run
a name, or keep the default as is. Then click
"Start".
Once you click start you will be navigated to the Runs
page where you can see your previous runs that are have been
executed, or currently executing. You can click the most recently started Run to view it's execution graph.
You should see something similar to the following once the Run
completes.
Click the first "flip-coin" step. This step produces an output message either "heads" or "tails", confirm that you can see these logs after clicking this step and navigating to "Logs."
Note: By default we use kfp-tekton 1.5.x for this section so you will need kfp-tekton v1.5.x sdk installed in your environment
In the previous step we submitted a generated Pipeline
yaml via the GUI. We can also submit the Pipeline
code
directly either locally or via a notebook.
You can find the Pipeline
code example here. We can submit this to the DSP API Server by
including this code in the following Python script:
cd ${WORKING_DIR}/docs/example_pipelines
touch execute_pipeline.py
We can utilize the flip coin example in this location and submit it directly to the API server by doing the following:
# Add this to execute_pipeline.py we created earlier
import os
import kfp_tekton
from condition import flipcoin_pipeline
token = os.getenv("OCP_AUTH_TOKEN")
route = os.getenv("DSP_ROUTE")
client = kfp_tekton.TektonClient(host=route, existing_token=token)
client.create_run_from_pipeline_func(pipeline_func=flipcoin_pipeline, arguments={})
Note: If you are in an unsecured cluster, you may encounter
CERTIFICATE_VERIFY_FAILED
error, to work around this you can pass in the self-signed certs to the kfp-tekton client. For instance, if running inside a notebook or pod on the same cluster, you can do the following:
...
cert = "/run/secrets/kubernetes.io/serviceaccount/ca.crt"
client = kfp_tekton.TektonClient(host=route, existing_token=token, ssl_ca_cert=cert)
client.create_run_from_pipeline_func(pipeline_func=flipcoin_pipeline, arguments={})
Retrieve your token and DSP route:
# This is the namespace you deployed the DataSciencePipelinesApplication Custom Resource
DSP_Namespace=test-ds-project-1
# This is the metadata.name of that DataSciencePipelinesApplication Custom Resource
DSP_CR_NAME=sample
export DSP_ROUTE="https://$(oc get routes -n ${DSP_Namespace} ds-pipeline-ui-${DSP_CR_NAME} --template={{.spec.host}})"
export OCP_AUTH_TOKEN=$(oc whoami --show-token)
And finally execute this script and submit the flip coin example:
python execute_pipeline.py
You can navigate to the UI again and find your newly created run there, or you could amend the script above and list
the runs via client.list_runs()
.
To remove a DataSciencePipelinesApplication
from your cluster, run:
# Replace environment variables accordingly
oc delete ${YOUR_DSPIPELINE_NAME} -n ${YOUR_DSPIPELINES_NAMESPACE}
The DSPO will clean up all manifests associated with each DataSciencePipelinesApplication
instance.
Or you can remove all DataSciencePipelinesApplication
instances in your whole cluster by running the following:
oc delete DataSciencePipelinesApplication --all -A
Depending on how you installed DSPO, follow the instructions below accordingly to remove the operator:
To uninstall DSPO via ODH run the following:
DSC_NAME=$(oc get DataScienceCluster -o jsonpath='{.items[0].metadata.name}')
ODH_NS=$(oc get DataScienceCluster -o jsonpath='{.items[0].metadata.namespace}')
oc delete datasciencecluster ${DSC_NAME} -n "${ODH_NS}"
To clean up standalone DSPO deployment:
# WORKING_DIR must be the root of this repository's clone
cd ${WORKING_DIR}
make undeploy OPERATOR_NS=${ODH_NS}
oc delete project ${ODH_NS}
See .github/scripts/tests/README.md
(https://github.com/opendatahub-io/data-science-pipelines-operator/blob/main/.github/scripts/tests/README.md)
The Data Science Pipelines Operator exposes standard operator-sdk metrics for controller monitoring purposes.
In addition to these metrics, DSPO also exposes several custom metrics for monitoring the status of the DataSciencePipelinesApplications that it owns.
They are as follows:
data_science_pipelines_application_apiserver_ready
- Gauge that indicates if the DSPA's APIServer is in a Ready state (1 => Ready, 0 => Not Ready)data_science_pipelines_application_persistenceagent_ready
- Gauge that indicates if the DSPA's PersistenceAgent is in a Ready state (1 => Ready, 0 => Not Ready)data_science_pipelines_application_scheduledworkflow_ready
- Gauge that indicates if the DSPA's ScheduledWorkflow manager is in a Ready state (1 => Ready, 0 => Not Ready)data_science_pipelines_application_ready
- Gauge that indicates if the DSPA is in a fully Ready state (1 => Ready, 0 => Not Ready)
By default, the operator's log messages are set to info
severity.
If you wish to adjust the log verbosity, you can do so by modifying the ZAP_LOG_LEVEL
parameter in params.env file to your preferred severity level.
For a comprehensive list of available values, please consult the Zap documentation.
To build the DSPO locally :
Login oc using following command:
oc login
Install CRDs using the following command:
kubectl apply -k config/crd
Execute the following command:
go build main.go
To run the DSPO locally :
Execute the following command:
go run main.go --config=$config
Below is the sample config file (the tags for the images can be edited as required):
Images:
ApiServer: quay.io/opendatahub/ds-pipelines-api-server:latest
Artifact: quay.io/opendatahub/ds-pipelines-artifact-manager:latest
OAuthProxy: registry.redhat.io/openshift4/ose-oauth-proxy:v4.12.0
PersistenceAgent: quay.io/opendatahub/ds-pipelines-persistenceagent:latest
ScheduledWorkflow: quay.io/opendatahub/ds-pipelines-scheduledworkflow:latest
Cache: registry.access.redhat.com/ubi8/ubi-minimal
MoveResultsImage: registry.access.redhat.com/ubi8/ubi-micro
MariaDB: registry.redhat.io/rhel8/mariadb-103:1-188
MlmdEnvoy: quay.io/opendatahub/ds-pipelines-metadata-envoy:latest
MlmdGRPC: quay.io/opendatahub/ds-pipelines-metadata-grpc:latest
MlmdWriter: quay.io/opendatahub/ds-pipelines-metadata-writer:latest
To build your own images :
All the component images are available here and for thirdparty images here. Build these images from root as shown in the below example:
podman build . -f backend/Dockerfile -t quay.io/your_repo/dsp-apiserver:sometag
To run the tests:
Execute make test
or make unittest
or make functest
based on the level of testing as mentioned below:
make unittest
is a command that is often used to run only unit tests for individual units or components. These tests verify that each unit of code (e.g., functions or methods) behaves as expected. It is suggested to run this command often during the development process.
make functest
is a command used to run functional tests which assess the overall functionality of the software by testing its features and user interactions. Functional tests help ensure that the software works as a whole and that its features are functioning correctly from a user's perspective. It is suggested to run make functest for every commit.
The specific tests that are executed when you run make test
can include unit tests, functional tests and more. It is a test to check if the software behaves correctly and meets the desired quality standards. It helps identify and fix issues early in the development process. It is suggested to run make test before creating a PR.
To deploy DSPO as a developer :
Follow the instructions from here to deploy the standalone operator.
Follow the instructions from here to deploy the operator via ODH.
How to deploy with a custom image:
Run the following command using the custom image:
make deploy IMG=my-registry/my-operator:v1
How to regenerate manifests:
After updating the Kubebuilder annotations in your code, run the following command to regenerate code and manifests:
make generate manifests
How to regenerate crd on api changes:
After making your API changes, run the following command to regenerate code and CRDs based on your updated API definitions:
make generate
Refer to kubebuilder docs here for more info.
How to run pre-commit tests:
Install pre-commit following the instructions here. Before creating a PR, developers should run the following command which will auto fix any simple errors:
pre-commit run --all-files
How to do disable health checks when dev testing:
To disable the health checks set the values to true in the DSPA yaml file you apply. Refer to his sample file here.
In certain scenarios, it may be necessary to disable health checks within our environment. When the DSPO is executed either locally or on a different cluster, the health checks can't reach the database and Object Store endpoints. Consequently, they remain unsuccessful, preventing the deployment of essential pipeline infrastructure components by the DSPA. To address this challenge, we have introduced the disableHealthCheck
mechanism as a viable solution.
How to enable kfp ui and minio:
Refer to this sample yaml file for enabling the upstream kubeflow pipelines ui and minio.
Refer to this repo to see examples of different pipelines for dev testing.
Refer to RELEASE.md for details on the release process