This is the project description for the Machine Learning Operations course (summer semester 2024) at Ludwig Maximilian University. The overall goal of the project is to classify hate speech tweets. We will be working with the Transformers framework, utilizing Open ELM models, and the TweetEval Dataset. The focus of the project work is to demonstrate the incorporation of machine learning operations tools.
As we are confronted with a natural language processing [NLP] task, we decided to use the Transformers framework. It provides a multitude of pretrained models for tasks like our text classification problem. Also, this framework is backed by PyTorch - our deep learning library of choice. We plan to utilize the BERT (Bidirectional Encoder Represenations from Transformers) model BERT provided by Hugging Face.
The data we have chosen to work with is the TweetEval Dataset which contains different tweets dataset for different NLP tasks. We will be working on the Hate Speech Detection dataset. It consists of tweets and labels (0: not-hate, 1: hate). The paper which offers this dataset also includes benchmarks for the best-performing models available at the time of publication.
Currently large language models has the state-of-the-art results for most NLP tasks. In this project, we will use BERT, which is known for its strong performance on various NLP tasks, including text classification. BERT uses a transformer-based architecture and has been pretrained on a large corpus of text. We will fine-tune BERT for the hate speech detection task using the TweetEval dataset.
.
├── LICENSE
├── Makefile
├── README.md
├── api <- Has scripts to create a FastAPI for inference
│ ├── __init__.py
│ └── main.py
├── cloudbuild <- Directory for continious integration with GCP
├── data <- Contains data and dvc file
│ ├── raw
│ └── raw.dvc
├── dockerfiles <- Contains dockerfiles for training, prediction, api
│ ├── hatespeech_base.dockerfile
│ ├── inference_api.dockerfile
│ ├── predict_model.dockerfile
│ └── train_model.dockerfile
├── docs
│ ├── README.md
│ ├── mkdocs.yaml
│ └── source
├── mlops_project <- Source code directory
│ ├── __init__.py
│ ├── data
│ │ └── make_dataset.py <- To get the data from the original source
│ ├── hate_speech_model.py <- Model
│ ├── checkpoints <- Contains the trained model weigts (tracked with dvc)
│ ├── checkpoints.dvc
│ ├── predict_model.py <- Script for prediction with trained model weights
│ └── train_model.py <- Script for training
├── outputs
│ └── predictions <- Contains outputs from for predict_model.py script
├── pyproject.toml <- File for building environment
├── reports <- Contains answers to LMU MLOps lecture questions
│ ├── README.md
│ ├── figures
│ └── report.py
├── requirements.txt <- requirements for inference
├── environment.yaml <- file to recreate the conda env
├── requirements_dev.txt <- requirements for development
├── tests <- Contains unit tests and api load tests
│ ├── __init__.py
│ ├── api_performance_locustfile.py
│ ├── test_api.py
│ ├── test_data.py
│ ├── test_hate_speech_model.py
│ ├── test_predict_model.py
│ └── test_utils.py
└── utils <- Contains utility functions to be used in other scripts
├── __init__.py
└── utils_functions.py
To create conda environment with the requirements of this repository, simply use
make conda_environment
To get the dataset and trained model weights, use
dvc pull
Note: You need GCP bucket permissions to be able to run this command
Predictions from this script are saved to outputs directory. To make a prediction, use
python mlops_project/predict_model.py \
--model_path=/your/model/path.pth \
--dataset_path=/your/data/path.txt
To run the inference api locally, use
uvicorn --port 8000 api.main:app
To use the api served by Google Cloud Platform you can use the following link
Welcome endpoint
https://hate-speech-detection-cloudrun-api-sjx4y77sda-ey.a.run.app
Prediction for one tweet end point
https://hate-speech-detection-cloudrun-api-sjx4y77sda-ey.a.run.app/predict_labels_one_tweet?tweet=this is my twwetttt
To train the model, specify a hyperparameter yaml file and use
python mlops_project/train_model.py --config=mlops_project/config/config-defaults-sweep.yaml
Please first build the base docker image before building train / predict / inference api docker images
docker build -f dockerfiles/hatespeech_base.dockerfile . -t hatespeech-base:latest
To build the docker image for inference api, use
docker build -f dockerfiles/inference_api.dockerfile . -t inference_api:latest
To build the docker image for prediction, use
docker build -f dockerfiles/predict_model.dockerfile . -t predict_model:latest
To build the docker image for training, use
docker build -f dockerfiles/train_model.dockerfile . -t train_model:latest
To run the docker image for inference api, use
docker run -p 8080:8080 -e PORT=8080 inference_api:latest
You can also use the predict_model docker image by mounting with your machine for your model weights and dataset
docker run -v /to/your/model/weight/path/best-checkpoint.pth:/container/models/best-checkpoint.pth \
-v /to/your/test_path/test_text.txt:/container/data/test_text.txt \
-v /to/your/outputs/predictions:/lmu-mlops-project/outputs/predictions \
predict_model:latest \
--model_path /container/models/best-checkpoint.pth \
--dataset_path /container/data/test_text.txt
To run training docker container use:
docker run -e WANDB_API_KEY=your_wandb_api_key \
train_model:latest --config=mlops_project/config/config-defaults.yaml
Unit tests for this repo can be found in the tests/
directory.
To do the locust test for the api load test run the following command
locust -f tests/api_performance_locustfile.py