Pipeline Breakdown:
- TPC-H Sample Data
- DuckDB and Pandas
- MinIO bucket
- Metabase Dashboard
TPC-H sample data is extracted, transformed, and loaded from MySQL using Python, DuckDB and Pandas into a MinIO bucket as a data lakehouse. A simple dashboard is created using Metabase. The data pipeline tasks are managed using Dagster, ensuring consistent data processing. Data is stored and managed using MinIO as a self-hosted, S3-compatible object storage solution. The data pipeline components are containerized using Docker.
- Data is extracted from MySQL using SQLAlchemy.
- The extracted data is transformed and loaded into a MinIO bucket using DuckDB.
- The data pipeline tasks are managed using Dagster.
- Data is stored and managed using MinIO.
- The data pipeline components are containerized using Docker.
1.) Why use DuckDB for ELT instead of Pandas or Polars? DuckDB offers a significant advantage with its SQL interface and its capability as a full OLAP database that supports different Client APIs. This makes it a beginner-friendly choice for my first project going with the data lakehouse architecture.
2.) Why use MinIO for data storage? MinIO provides a self-hosted, S3-compatible object storage solution, making it an ideal choice for managing and storing data locally.
This project was inspired by the Build a poor man’s data lake from scratch with DuckDB.