TPC-H Data Pipeline

Architecture

Pipeline Breakdown:

TPC-H Sample Data
DuckDB and Pandas
MinIO bucket
Metabase Dashboard

Overview

TPC-H sample data is extracted, transformed, and loaded from MySQL using Python, DuckDB and Pandas into a MinIO bucket as a data lakehouse. A simple dashboard is created using Metabase. The data pipeline tasks are managed using Dagster, ensuring consistent data processing. Data is stored and managed using MinIO as a self-hosted, S3-compatible object storage solution. The data pipeline components are containerized using Docker.

ELT Flow

Data is extracted from MySQL using SQLAlchemy.
The extracted data is transformed and loaded into a MinIO bucket using DuckDB.
The data pipeline tasks are managed using Dagster.
Data is stored and managed using MinIO.
The data pipeline components are containerized using Docker.

Reflect

1.) Why use DuckDB for ELT instead of Pandas or Polars? DuckDB offers a significant advantage with its SQL interface and its capability as a full OLAP database that supports different Client APIs. This makes it a beginner-friendly choice for my first project going with the data lakehouse architecture.

2.) Why use MinIO for data storage? MinIO provides a self-hosted, S3-compatible object storage solution, making it an ideal choice for managing and storing data locally.

Acknowledgment

This project was inspired by the Build a poor man’s data lake from scratch with DuckDB.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dagster		dagster
dagster_home		dagster_home
etl_pipeline		etl_pipeline
images		images
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
mysql_schemas.sql		mysql_schemas.sql
mysql_schemas2.sql		mysql_schemas2.sql
streamlit		streamlit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TPC-H Data Pipeline

Architecture

Overview

ELT Flow

Reflect

Acknowledgment

About

Uh oh!

Releases

Packages

Uh oh!

Languages

dedyoc/TPCH-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

TPC-H Data Pipeline

Architecture

Overview

ELT Flow

Reflect

Acknowledgment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages