This document describes how to run the MNIST example on Kubeflow Pipelines on a Google Cloud Platform and on premise cluster.
This pipeline requires a Google Cloud Storage bucket to hold your trained model. You can create one with the following command
BUCKET_NAME=kubeflow-pipeline-demo-$(date +%s)
gsutil mb gs://$BUCKET_NAME/
Follow the Getting Started Guide to deploy a Kubeflow cluster to GKE
If you set up your cluster with IAP enabled as described in the GKE Getting Started guide,
you can now access the Kubeflow Pipelines UI at https://<deployment_name>.endpoints.<project>.cloud.goog/pipeline
If you opted to skip IAP, you can open a connection to the UI using kubectl port-forward and browsing to http://localhost:8085/pipeline
kubectl port-forward -n kubeflow $(kubectl get pods -n kubeflow --selector=service=ambassador \
-o jsonpath='{.items[0].metadata.name}') 8085:80
For on premise cluster, beside of Kubeflow deployment, you need to create a Persistent Volume (PV) and Persistent Volume Claims(PVC) to store trained result.
Note that the accessModes
of the PVC should be ReadWriteMany
so that the PVC can be mounted by containers of multiple steps in parallel.
Set up a virtual environment for your Kubeflow Pipelines work:
python3 -m venv $(pwd)/venv
source ./venv/bin/activate
Install the Kubeflow Pipelines sdk, along with other Python dependencies in the requirements.txt file
pip install -r requirements.txt --upgrade
Pipelines are written in Python, but they must be compiled into a domain-specific language (DSL) before they can be used.
For on premise cluster, update the platform
to onprem
in mnist_pipeline.py
.
sed -i.sedbak s"/platform = 'GCP'/platform = 'onprem'/" mnist_pipeline.py
Most pipelines are designed so that simply running the script will preform the compilation steps:
python3 mnist_pipeline.py
Running this command should produce a compiled * mnist_pipeline.py.tar.gz* file:
Additionally, you can compile manually using the dsl-compile script
python venv/bin/dsl-compile --py mnist_pipeline.py --output mnist_pipeline.py.tar.gz
Now that you have the compiled pipelines file, you can upload it through the Kubeflow Pipelines UI. Simply select the "Upload pipeline" button
Upload your file and give it a name
After clicking on the newly created pipeline, you should be presented with an overview of the pipeline graph. When you're ready, select the "Create Run" button to launch the pipeline
Fill out the information required for the run, and press "Start" when you are ready.
- GCP: Fill out the GCP
$BUCKET_ID
you created earlier, and ignore the optionpvc_name
. - On premise cluster: Fill out the
pvc_name
as name of the PVC you created earlier, and the PVC is mounted to '/mnt', so themodel-export-dir
can be/mnt/export
.
After clicking on the newly created Run, you should see the pipeline run through the 'train', 'serve', and 'web-ui' components. Click on any component to see its logs. When the pipeline is complete, look at the logs for the web-ui component to find the IP address created for the MNIST web interface
Now that we've run a pipeline, lets break down how it works
@dsl.pipeline(
name='MNIST',
description='A pipeline to train and serve the MNIST example.'
)
Pipelines are expected to include a @dsl.pipeline
decorator to provide metadata about the pipeline
def mnist_pipeline(model_export_dir='gs://your-bucket/export',
train_steps='200',
learning_rate='0.01',
batch_size='100'
pvc_name=''):
The pipeline is defined in the mnist_pipeline function. It includes a number of arguments, which are exposed in the Kubeflow Pipelines UI when creating a new Run.
Although passed as strings, these arguments are of type kfp.dsl.PipelineParam
train = dsl.ContainerOp(
name='train',
image='gcr.io/kubeflow-examples/mnist/model:v20190304-v0.2-176-g15d997b',
arguments=[
"/opt/model.py",
"--tf-export-dir", model_export_dir,
"--tf-train-steps", train_steps,
"--tf-batch-size", batch_size,
"--tf-learning-rate", learning_rate
]
)
This block defines the 'train' component. A component is made up of a kfp.dsl.ContainerOp
object with the container path and a name specified. The container image used is defined in the Dockerfile.model in the MNIST example
After defining the train component, we also set a number of environment variables for the training script
serve = dsl.ContainerOp(
name='serve',
image='gcr.io/ml-pipeline/ml-pipeline-kubeflow-deployer:\
7775692adf28d6f79098e76e839986c9ee55dd61',
arguments=[
'--model-export-path', model_export_dir,
'--server-name', "mnist-service"
]
)
The 'serve' component is slightly different than 'train'. While 'train' runs a single container and then exits, 'serve' runs a container that launches long-living
resources in the cluster. The ContainerOP takes two arguments: the path we exported our trained model to, and a server name. Using these, this pipeline component
creates a Kubeflow tf-serving
service within the cluster. This service lives after the
pipeline is complete, and can be seen using kubectl get all -n kubeflow
. The Dockerfile used to build this container can be found here.
The serve.after(train)
line specifies that this component is to run sequentially after 'train' is complete
web_ui = dsl.ContainerOp(
name='web-ui',
image='gcr.io/kubeflow-examples/mnist/deploy-service:latest',
arguments=[
'--image', 'gcr.io/kubeflow-examples/mnist/web-ui:\
v20190304-v0.2-176-g15d997b-pipelines',
'--name', 'web-ui',
'--container-port', '5000',
'--service-port', '80',
'--service-type', "LoadBalancer"
]
)
web_ui.after(serve)
Like 'serve', the web-ui component launches a service that exists after the pipeline is complete. Instead of launching a Kubeflow resource, the web-ui launches a standard Kubernetes Deployment/Service pair. The Dockerfile that builds the deployment image can be found here. This image is used to deploy the web UI, which was built from the Dockerfile found in the MNIST example
After this component is run, a new LoadBalancer is provisioned that gives external access to a 'web-ui' deployment launched in the cluster.
steps = [train, serve, web_ui]
for step in steps:
if platform == 'GCP':
step.apply(gcp.use_gcp_secret('user-gcp-sa'))
else:
step.apply(onprem.mount_pvc(pvc_name, 'local-storage', '/mnt'))
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(mnist_pipeline, __file__ + '.tar.gz')
For each step, if run under GCP, it is run with access to 'user-gcp-sa' secret, which gives read/write access to GCS resources (during training) and access to the 'kubectl' command within the container (during serving).
If run on premise, it is run with access to pvc_name that is passed in as pipeline argument.
At the bottom of the script is a main function. This is used to compile the pipeline when the script is run