This repository provides a comprehensive environment to learn and practice modern data engineering concepts using Docker Compose. With this setup, you'll gain hands-on experience working with some of the most popular tools in the data engineering ecosystem. Whether you’re a beginner or looking to expand your skills, this tech stack offers everything you need to work with ETL/ELT pipelines, data lakes, querying, and visualization.
- Apache Spark: Learn distributed data processing with one of the most powerful engines for large-scale data transformations.
- Apache Iceberg: Understand how to work with Iceberg tables for managing large datasets in data lakes with schema evolution and time travel.
- Project Nessie: Explore version control for your data lakehouse, allowing you to track changes to datasets just like Git for code.
- Apache Airflow: Master workflow orchestration and scheduling for complex ETL/ELT pipelines.
- Trino: Query your data from multiple sources (MinIO, PostgreSQL, Iceberg) with a fast federated SQL engine.
- Apache Superset: Create interactive dashboards and visualizations to analyze and present your data.
- MinIO: Learn about object storage and how it integrates with modern data pipelines, serving as your S3-compatible storage layer.
- PostgreSQL: Use a relational database for metadata management and storing structured data.
- Overview
- What You Will Learn
- Services
- How to Run
- Learning Objectives for Each Tool
- Environment Variables
- Troubleshooting
- License
- Purpose: Acts as an S3-compatible object storage layer for raw and processed data.
- What You’ll Learn:
- Uploading and managing files via the MinIO Console.
- Using MinIO as a source and destination for ETL pipelines.
- Console URL: http://localhost:9001
- Purpose: Workflow orchestration and ETL pipeline management.
- What You’ll Learn:
- Building and scheduling DAGs (Directed Acyclic Graphs) to automate workflows.
- Managing and monitoring pipelines through the Airflow web interface.
- Web Interface URL: http://localhost:8081
- Purpose: Run Spark and custom Python scripts for data ingestion and transformation.
- What You’ll Learn:
- Writing Spark jobs to process large-scale datasets.
- Using Python scripts to ingest and transform data into MinIO or PostgreSQL.
- Purpose: A relational database for storing metadata, structured data, and managing transactions.
- What You’ll Learn:
- Querying relational datasets using SQL.
- Storing and retrieving structured data for analysis or visualization.
- Purpose: Version control for data lakes and Iceberg tables.
- What You’ll Learn:
- Creating branches and commits for data changes.
- Rolling back or time-traveling to previous versions of datasets.
- API URL: http://localhost:19120
- Purpose: A federated query engine for SQL-based exploration of multiple data sources.
- What You’ll Learn:
- Querying data stored in MinIO, PostgreSQL, and Iceberg tables.
- Using SQL to join data from different sources.
- Web Interface URL: http://localhost:8080
- Purpose: Visualization and dashboarding tool for creating insights from data.
- What You’ll Learn:
- Building interactive dashboards connected to Trino and PostgreSQL.
- Analyzing data through visualizations.
- Dashboard URL: http://localhost:8088
- Clone the Repository
git clone https://github.com/username/Iceberg-trino-superset.git cd Iceberg-trino-superset
- Set Environment Variables
- Fill in your configurations in .env/airflow.env, .env/minio.env, .env/postgres.env, etc.
-
Build & Start Services
docker-compose up -d --build
-
Access Servies
- MinIO Console: http://localhost:9001
- Airflow Web UI: http://localhost:8081
- PostgreSQL: localhost:5432
- Nessie: http://localhost:19120
- Trino: http://localhost:8080
- Superset: http://localhost:8088
- Understand how distributed computing works: Learn how Apache Spark processes large datasets in parallel across multiple nodes, making it an essential tool for handling big data efficiently.
- Write transformations for ETL pipelines on large-scale datasets: Build and execute transformations that extract, clean, and load data into storage or analytics layers.
- Work with Iceberg tables for schema evolution and time travel: Gain experience managing dataset changes over time without breaking downstream dependencies.
- Manage partitions for optimized querying: Use Iceberg's built-in partitioning to improve query performance on large datasets.
- Learn to implement Git-like workflows for datasets: Track changes, create branches, and roll back changes to maintain consistency in your data pipelines.
- Manage branches, tags, and commits for Iceberg tables: Use Nessie to version-control datasets and simplify collaboration across teams.
- Build workflows that orchestrate Spark, MinIO, and PostgreSQL: Automate complex data workflows involving multiple tools and dependencies.
- Monitor and troubleshoot DAG executions: Learn to manage and debug Directed Acyclic Graphs (DAGs) for efficient task scheduling.
- Query heterogeneous data sources (e.g., MinIO, PostgreSQL) in a unified SQL layer: Use Trino to query structured and unstructured data seamlessly across multiple backends.
- Analyze Iceberg tables and large datasets efficiently: Leverage Trino's SQL engine to perform ad hoc analysis or integrate with BI tools.
- Build visualizations and interactive dashboards: Create engaging charts and dashboards to derive insights from your data.
- Connect Superset to Trino and PostgreSQL for real-time insights: Visualize data in near real-time to support decision-making.
- Store and retrieve files in an S3-compatible system: Manage object storage for raw and processed data in a local or cloud environment.
- Integrate MinIO with Spark and Airflow pipelines: Use MinIO as a central hub for ingesting, transforming, and exporting data in your workflows.