Skip to content

Foundation Workspace for Airflow, Spark, Hive, and Azure Data Lake Gen2 via Docker

Notifications You must be signed in to change notification settings

aaliashraf/airflow-spark-hive-azure-docker-workflow

Repository files navigation

Foundation Workspace for Airflow, Spark, Hive, and Azure Data Lake Gen2 via Docker

Welcome to the Foundation Workspace repository! This project aims to provide a comprehensive workspace environment for data engineering tasks involving Airflow, Spark, Hive, and Azure Data Lake Gen2. By leveraging Docker, users can easily set up a consistent environment with all necessary dependencies for their ETL (Extract, Transform, Load) workflows.

The WorkSpace contains the following Dependencies

Tool Version Description
Docker 24.0.7 See Mac installation instructions.
Java 17 SDK openjdk-17-jre-headless In DockerFile "RUN apt-get install -y openjdk-17-jre-headless".
Airflow apache/airflow:2.8.4-python3.10 Base Image. See release history here
Spark version 3.5.1 bitnami/spark:latest See release history here.
Hive apache/hive:4.0.0-alpha-2 See release history here.
Azure Data Lake Gen2 hadoop-azure-3.3.1.jar The JAR must be configured during Spark Submit here.
Python 3.10 Installed using apache/airflow:2.8.4-python3.10 Image .
PySpark version 3.5.1 This should match the Spark version.

Features

  • Dockerized Environment: The project offers Docker containers configured with Airflow, Spark, Hive, and Azure Data Lake Gen2 dependencies, ensuring seamless setup across different platforms.

  • Complete ETL Examples: Explore two comprehensive ETL examples included in the repository:

    • Azure Data Lake Gen2: Connect and perform ETL operations using PySpark, demonstrating integration with Azure Data Lake Gen2.
    • Local Metastore: Work with a local metastore and perform ETL tasks using PySpark and Hive, showcasing flexibility in different data storage setups.
  • Diverse DAGs: Various types of Directed Acyclic Graphs (DAGs) are provided, incorporating Python operators and Bash operators to demonstrate different workflow configurations and task executions.

  • Configuration Files: Essential configuration files such as Dockerfile, Java layout, and Docker Compose files are included, simplifying setup and customization of the workspace environment.

Getting Started

  • Clone Repo

Clone the repository to your local machine

git clone https://github.com/aaliashraf/airflow-spark-hive-azure-docker-workflow.git

Navigate to the Repo directory

cd airflow-spark-hive-azure-docker-workflow
  • Build Docker Image

docker compose build

Run the following command to generate the .env file containing the required Airflow UID

echo AIRFLOW_UID=1000 > .env
  • Bringing Up Container Services

docker compose up

Accessing Services

After starting the containers, you can access the services through the following URLs:

Airflow

Username: airflow
Password: airflow

Spark

Hive

Contents

  • /dags: Contains Airflow DAGs and workflows for ETL tasks.
  • /logs: Airflow logs.
  • /plugins: Airflow plugins.
  • /src: Utility scripts, PySpark code, and JARs.
  • /metastore: Contains Hive database and tables locally.
  • Dockerfile: Dockerfiles for building custom Docker images.
  • docker-compose.yaml: Docker Compose file for orchestrating containers.

have fun! 🚀🚀🚀