Project: Boilerplate ML Pipeline

ML Pipeline is a machine learning pipeline system focused on ensuring scalability, reproducibility, and flexibility across various projects. While this documentation demonstrates the usage of BERT and the Rotten Tomatoes dataset as an example, boilerplate's design is modular. This allows developers to quickly and seamlessly integrate other models or datasets into the pipeline.

Features

Modularity: Easily plug in different preprocessing, models, or datasets.
Scalability: Designed with scalability in mind. Effortlessly switch from local experiments to cloud deployments.
Reproducibility: MLflow integration ensures tracking of every experiment, making them reproducible at any time.
End-to-End Workflow: From data fetching, cleaning, preparing, to training and testing - it's all in one place.

Installation

Clone the repository:

git clone https://github.com/rmarquet21/boilerplate-ml-pipeline.git

Navigate to the project directory:

cd boilerplate-ml-pipeline

Install the required packages:

pip install -r requirements.txt

Quick Start

Set up MLflow tracking:

alfred run:server

Run the pipeline:

alfred run:pipeline

Visit http://localhost:5000 in your browser to view MLflow's UI and monitor the progress.

Directory Structure

.
├── alfred
│   └── run.py
├── poetry.lock
├── pyproject.toml
├── training_pipeline
│   ├── __init__.py
│   ├── run.py
│   ├── pipeline_context.py
│   └── steps
│       ├── __init__.py
│       ├── base_step.py
│       ├── clean_data_step.py
│       ├── fetch_data_step.py
│       ├── prepare_data_step.py
│       └── train_data_step.py

Integrating New Models or Datasets

Datasets: To integrate a new dataset, extend the FetchDataStep in fetch_data_step.py. Use the load_dataset method or any other preferred method to fetch your data.
Models: To work with a different model, extend the PrepareDataStep for data tokenization/preparation and the TrainDataStep for training the model.

Remember, the pipeline is built with modularity in mind. Each step works as an independent module, ensuring flexibility and scalability.

Logging and Monitoring with MLflow

Repo comes integrated with MLflow for experiment tracking. Log metrics, parameters, and even save model checkpoints. The FetchDataStep demonstrates the basic usage of MLflow. Extend it by logging more parameters, metrics, or artifacts.

Contributions

Contributions are always welcome. If you want to contribute, please:

Fork the project.
Create a new branch.
Commit your changes.
Push to the branch.
Open a pull request.

License

MIT License. See LICENSE for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!