ML Pipeline is a machine learning pipeline system focused on ensuring scalability, reproducibility, and flexibility across various projects. While this documentation demonstrates the usage of BERT and the Rotten Tomatoes dataset as an example, boilerplate's design is modular. This allows developers to quickly and seamlessly integrate other models or datasets into the pipeline.
- Modularity: Easily plug in different preprocessing, models, or datasets.
- Scalability: Designed with scalability in mind. Effortlessly switch from local experiments to cloud deployments.
- Reproducibility: MLflow integration ensures tracking of every experiment, making them reproducible at any time.
- End-to-End Workflow: From data fetching, cleaning, preparing, to training and testing - it's all in one place.
Clone the repository:
git clone https://github.com/rmarquet21/boilerplate-ml-pipeline.git
Navigate to the project directory:
cd boilerplate-ml-pipeline
Install the required packages:
pip install -r requirements.txt
- Set up MLflow tracking:
alfred run:server
- Run the pipeline:
alfred run:pipeline
- Visit
http://localhost:5000
in your browser to view MLflow's UI and monitor the progress.
.
├── alfred
│ └── run.py
├── poetry.lock
├── pyproject.toml
├── training_pipeline
│ ├── __init__.py
│ ├── run.py
│ ├── pipeline_context.py
│ └── steps
│ ├── __init__.py
│ ├── base_step.py
│ ├── clean_data_step.py
│ ├── fetch_data_step.py
│ ├── prepare_data_step.py
│ └── train_data_step.py
-
Datasets: To integrate a new dataset, extend the
FetchDataStep
infetch_data_step.py
. Use theload_dataset
method or any other preferred method to fetch your data. -
Models: To work with a different model, extend the
PrepareDataStep
for data tokenization/preparation and theTrainDataStep
for training the model.
Remember, the pipeline is built with modularity in mind. Each step works as an independent module, ensuring flexibility and scalability.
Repo comes integrated with MLflow for experiment tracking. Log metrics, parameters, and even save model checkpoints. The FetchDataStep
demonstrates the basic usage of MLflow. Extend it by logging more parameters, metrics, or artifacts.
Contributions are always welcome. If you want to contribute, please:
- Fork the project.
- Create a new branch.
- Commit your changes.
- Push to the branch.
- Open a pull request.
MIT License. See LICENSE for more information.