Name		Name	Last commit message	Last commit date
parent directory ..
dp_evaluate		dp_evaluate
dp_preprocess		dp_preprocess
dp_train		dp_train
README.md		README.md
arguments.yaml		arguments.yaml
build_components.py		build_components.py
client_secrets.example.yaml		client_secrets.example.yaml
demo_blocks.py		demo_blocks.py
kubeflow.png		kubeflow.png
pipeline.png		pipeline.png
run_pipeline.py		run_pipeline.py
test_claim.yaml		test_claim.yaml
user-role_secrets.example.yaml		user-role_secrets.example.yaml

README.md

Pipeline example

This is an end-to-end example of pipeline using Kubeflow Pipelines and PrivateKube to train an LSTM on a subset of the Amazon Reviews dataset.

Overview

We will:

deploy PrivateKube and initialize some privacy blocks
compile the DP preprocessing, training and evaluation components
build a DP pipeline with PrivateKube wrappers to allocate and consume some privacy budget
run the pipeline on Kubeflow

Requirements

You will need a Kubernetes cluster with Kubeflow 1.3 deployed on Google Cloud Platform. This guide from the Kubeflow documentation explains how to obtain such a setup.

More precisely, will use Google Kubernetes Engine, Google Cloud Engine (for the cluster), Google Cloud Storage (to store the artifacts) and Google Cloud Registry (to store the component images).

Step-by-step guide

Deploy PrivateKube

Once your Kubeflow cluster is ready and configured, follow the instructions to deploy PrivateKube.

Authenticate

Create a configuration file for your cluster and container registry:

cp client_secrets.example.yaml client_secrets.yaml
vim client_secrets.yaml

The Kubeflow client id and secret can be retrieved with these instructions.

Similarly, add a cluster role binding to authorize Kubeflow to manipulate the privacy resource:

cp user-role_secrets.example.yaml user-role_secrets.yaml
vim user-role_secrets.yaml
kubectl apply -f user-role_secrets.yaml

Create some blocks

Now that the cluster is configured, you can deploy some private data blocks. We prepared a public dataset with some preprocessed Amazon Reviews. Each block contains the data for 1 day for 1/100th of the users, in HDF5 format.

The following script will create some blocks automatically on your cluster:

python demo_blocks.py create --days 10 --users 1

Build the components

This step is optional, since we prebuilt the components used in the pipeline and published them on DockerHub.

If you want to create your own pipeline, you should modify the Kubeflow components such as dp_evaluate or dp_train. These components are Docker images containing some machine learning functionalities (e.g., preprocessing or neural network training with PyTorch), along with a Yaml interface specifying the inputs and the outputs for Kubeflow Pipelines.

You can build these components automatically with:

python  build_components.py

You can also modify the allocate and consume components to your needs (for instance if you want to request by user ids). The source for these components is here. Once you're done with the modifications, you can build a container image by hand or reuse our script:

 python build_components.py ~/PrivateKube/privatekube/privatekube/kfp/components_src/claim

Write and run the pipeline

The run_pipeline.py uses Kubeflow's DSL to build a pipeline with the components.

Here is the corresponding graph:

The structure of the pipeline is similar to the one described in the OSDI paper:

We request some private data blocks
We download the corresponding data
We process this data and use it to train a DP machine learning model
We consume the privacy budget that we used
We publish the result of the computation

Once the pipeline is ready, you can fill-in the runtime arguments in arguments.yaml.

Then, the following command will compile the pipeline and send it to your cluster to be processed:

python run_pipeline.py

The first time you run this script, your terminal might prompt you to authenticate your connection to Kubeflow with your Google account.

Examine the results

Finally, you can head to your Kubeflow dashboard to check the result of the pipeline. The upload components will use the Google storage bucket you specified to write their outputs.

You can also examine the scheduler logs and the privacy resources as explained in the main README.

Troubleshooting

Cleaning-up

To delete all the blocks and all the claims:

kubectl delete pbc --all -n privacy-example

kubectl delete pb --all -n privacy-example

Adding GPU support

You can add some options to the components of the Kubeflow pipeline. For instance, to retry the training twice and run on a GPU with at most 6 CPUs, you can add:

    dp_train_task.set_gpu_limit(1)
        .set_cpu_limit("6")
        .set_retry(1)

Scheduler

To restart the scheduler, you can delete the pod and let the deployment spawn a fresh instance:

kubectl delete pod "$(kubectl get pods -n privatekube | grep scheduler | awk -F ' ' '{print $1}')" -n privatekube

To start the scheduler with new parameters (e.g., a lower value of N so you can allocate pipelines without waiting too much), you can edit the deployment file and apply your changes with kubectl apply -f scheduler.yaml.

Caching

Kubeflow offers a cache thanks to ML Metadata. It is extremely useful to speed up similar components and to debug pipelines. However, this cache can interfere with PrivateKube's privacy claims: if the component and the inputs are identical, Kubeflow can reuse the same private data blocks without running the allocation component again, thus overspending privacy budget. To avoid this, we currently deactivate caching for privacy claims.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline

pipeline

README.md

Pipeline example

Overview

Requirements

Step-by-step guide

Deploy PrivateKube

Authenticate

Create some blocks

Build the components

Write and run the pipeline

Examine the results

Troubleshooting

Cleaning-up

Adding GPU support

Scheduler

Caching

Files

pipeline

Directory actions

More options

Directory actions

More options

Latest commit

History

pipeline

Folders and files

parent directory

README.md

Pipeline example

Overview

Requirements

Step-by-step guide

Deploy PrivateKube

Authenticate

Create some blocks

Build the components

Write and run the pipeline

Examine the results

Troubleshooting

Cleaning-up

Adding GPU support

Scheduler

Caching