This repository showcases edge, a tool for deploying models to Vertex on Google Cloud Platform. We've also provided a reference example that shows how to train and deploy a simple model to GCP, and we show how to get up-and-running with everything you need to do MLOps right (in our opinion).
You can use this repo as a template for your projects. To get started, fork this repo and follow the instructions in this README to configure the tools for your GCP enviroment. By the end, you'll have an MLOps-ready environment with a simple model trained and deployed.
This is a new project and we're keen to get feedback from the community that will help us to improve it. Please do raise and discuss issues, send us pull requests, and don't forget to like and subscribe star and fork.
With this project we set out to address the following questions:
- How do we version data?
- How would two data scientists work collaboratively on a model?
- How do we track experiments?
- How do we set up a training pipeline in the cloud?
- How do we test the model?
- How do we serve the model?
- How do other software components interact with the model?
- How do we monitor the model the model on an ongoing basis?
- Concepts - the underlying MLOps concepts in this example.
- Installing on your GCP environment
- Training your first model in GCP
- Setting up CI/CD with CircleCI
Any productionised machine learning project will consist not only of models but other software components that are necessary in order to make those models useful. We will typically be building models along-side other pieces of software. Both of these need to be tracked, deployed, and monitored, but the approach taken for models differs somewhat from other kinds of software.
A machine learning model passes through a few stages of life. Let's look at those stages.
Imagine a team of data scientists starting a project from scratch. At this stage there are numerous unknowns, but we can still introduce some tools that will make life easier.
The data may not be well-understood, and it may be incomplete. It's important to have data version control from the very start, because:
- It's easier for a team to share data while ensuring that everybody is working with the same version of that data.
- It allows us to track changes over time.
- We can link every experiment and deployed model to a specific data version.
We use DVC to do data versioning. DVC has a number of other features, including pipelines, which we'll discuss next.
Ultimately we want to train a model, so we'll need to write some code as well. Code versioning is just as important as data versioning, for exactly the same reasons as stated above.
We're using Git to track code versions. It's worth noting that DVC interoperates with Git, so this single code repository is enough to get somebody up-and-running with everything they will need in order to train the model.
Training a model involves a few steps. At the very least, we must prepare data and then run a training script. We use DVC to specify a training pipeline. Something to keep in mind: we're going to be talking about two different kinds of pipeline:
- Model training pipeline - a DVC pipeline which first prepares the training data, and then trains a model.
- CI / CD pipeline - using CircleCI we can combine training and deployment into a single pipeline.
This admittedly gets a little bit confusing, because 'pipeline' means two different things depending on the context. The CI/CD pipeline itself runs the model training pipeline.
Every run of the model training pipeline gets logged to a central location. Any time we run this pipeline, we call that an experiment. In any experiment, we record:
- When it ran, who ran it, and where it ran.
- The Git commit associated with the experiment.
- The data version associated with the experiment.
- The hyperparameters in use.
- The performance of the model.
This way, anybody on the team is able to review past experiments and reproduce them consistently.
We use Sacred with Omniboard for experiment tracking (MongoDB is used as the backing database). The vertex:edge tool will install the experiment tracker into your GCP environment so that you can log and review experiments.
When it comes to training our model, we want to use cloud-based resources. This gives us more computational power, but it also centralises training and prepares us for cloud-based deployment, which will come later.
Vertex can already train models in the cloud - that's what it does best! - but we also want it to work seamlessly with data versioning and experiment tracking.
The DVC model training pipeline has two steps:
- Data preparation: generate a training and testing dataset.
- Train: execute a training script on Vertex and capture the resulting model.
Once a model has been deployed, we'd ideally like to monitor it. The purpose of monitoring a model is to make us aware of changes to its behaviour. Changes come for example through shifts in the input data. Additionally, we want to know about potential biases in the training data.
In response to monitoring we can make informed decisions. For instance we might decide to re-train the model with new data if we feel that this model no longer reflects reality in some way.
We haven't implemented any monitoring so far in this project, but this is something we'd like to add in the future.
Finally, we want to deploy a model. We introduce CI/CD, using Circle CI, for this. A Circle pipeline itself invokes the model pipeline. The model pipeline in turn starts a training job on Vertex. It also pushes an experiment to experiment tracking, and a trained model to the Vertex model registry.
The model is deployed along with an endpoint, which exposes the model for online inference.
Here's a brief guide to how this project is organised:
- data - data used for our example model. This data comes from the Fashion MNIST dataset. We don't commit the data to Git; DVC manages the data.
- edge - code for the vertex:edge tool.
- models - each model has its own sub-directory under
models
, and within each model directory we have training code and the training pipeline. - services - models by themselves aren't useful without things that interact with the model.
services
contains deployable web services that interact with models.
Finally, we come to the vertex:edge tool (edge.py
) whose purpose is to simplify setting up a machine learning project on Google Cloud Platform from scratch.
It can:
- Run a configuration wizard and save the resulting config for future use.
- Set up all the necessary resources in GCP, namely
- Initialise DVC in the repository.
- Enable required Google Cloud APIs.
- Create a Storage bucket for dataset and model storage.
- Set up Vertex AI Endpoint for model deployment.
- Create Kubernetes cluster and set up Sacred / Omniboard on it for experiment tracking.
- Build and push Docker images for a web app, and for model serving.
- Deploy a web app to Cloud Run.
- Deploy a trained model to Vertex AI.
Next we'll look at how to use this script to setup an MLOps-ready project in GCP.
We recommend forking this repository at this point. You'll be changing configuration files, and so you'll want your own repo so you can push the changes.
- Python 3
- pyenv
- gcloud command line tool
- helm - only needed if you're running the installation outside of Docker
- kubectl - only needed if you're running the installation outside of Docker
To make collaboration go smoothly, we really want to make sure that every developer can reproduce the same development environment, which means everybody uses the same versions of Python, and the same Python dependencies.
First, to manage Python, we'll use PyEnv. Follow the instructions for your operating system; once installed, PyEnv will download and make available the appropriate version of Python for you.
The Python version for this project is kept in .python-version. We can install and activate this version of Python by running:
pyenv install
Now if you run python --version
it will match what's in .python-version.
With the correct version of Python set up, we'll use Python venv to provide an isolated Python environment, and PIP to install and manage Python dependencies.
python -m venv env/
source env/bin/activate
pip install -r requirements.txt
Now you'll need a GCP account, so sign up for one if you haven't already done so. Within your GCP account, create a new project, or you can use an existing project if you prefer.
If you haven't got the gcloud
command line tool, install it now.
Then authenticate by running:
gcloud auth login
Followed by
gcloud auth application-default login
The configuration file edge.yaml
contains all the information needed to setup tools and models in GCP. The edge.yaml
that we've provided is only an example; you'll need to create your own.
To start the configuration wizard, run
./edge.py config
This will ask you a series of questions and then it will overwrite edge.yaml
with your new configuration.
If you're happy with the configuration, you're now ready to install all the things:
The easiest way to do this is using Docker. First, build the image:
docker build -t edge
Then run
./edge_docker.sh install
Alternatively, if you don't want to run it in Docker, simply use
./edge.py install
If for any reason you need to undo the installation, you can uninstall using
./edge.py uninstall
So, you've forked this repository, you've configured and installed all of the tools on GCP, and finally you're ready to train a model - yay!
The model that we're going to train is based on the Fashion MNIST dataset. Some important things to remember:
- We don't store the datasets in Git, so before you can train the model, you'll need to download the dataset and initialise DVC (data version control).
- All training is done on Vertex. We currently don't support local training, i.e. running the training script on your own computer.
We need to download the original dataset, which is simple enough. But we also want to setup data version control, and we want to ensure that the data is backed to a central location.
Having ran the installation script, you already have a Cloud Storage Bucket which will serve as the central location. We'll run two commands:
./seed_data.sh
dvc push
The seed_data.sh
script downloads the dataset and registers it with DVC. Then, dvc push
will push that data up to the Cloud Storage bucket (similar to git push
).
Start by making sure you are using the most recent data version:
dvc pull
We need to build a custom Docker image, which will be used for serving the model. The edge
script takes care of building and pushing this to GCP:
./edge.py vertex build-docker
Earlier we mentioned that we use DVC to provide the model training pipeline. This pipeline will execute the model training step on Vertex:
dvc repro models/fashion/dvc.yaml
After having ran this, a custom training job will appear under https://console.cloud.google.com/vertex-ai/training/custom-jobs.
Each run of the training pipeline gets logged to the experiment tracker. To view experiments, you'll first need to get the dashboard URL:
./edge.py omniboard
If you visit this URL in a browser you will see the history of all experiments.
You can now deploy the trained model to Vertex:
./edge.py vertex deploy
Having ran this, a model will be available under https://console.cloud.google.com/vertex-ai/models
The demo web app is a simple web application that is intended to work with the fashion model that we trained before. There are two ways to run it: you can run it in Docker locally, or deploy it to GCP Cloud Run.
./edge.py webapp run
./edge.py webapp build-docker
./edge.py webapp deploy
Follow the instructions
Follow the instructions
This service account must have the following roles:
- Vertex AI user
- Service Account User
- Cloud Run Admin
- Secret Manager Secret Accessor
- Storage Admin
- GKE admin