Skip to content

Latest commit

 

History

History
96 lines (62 loc) · 3.75 KB

File metadata and controls

96 lines (62 loc) · 3.75 KB

Project 3 - ETL with Apache Airflow and DBT

1. Context

The main challenge is ingest data from a CSV and API using Apache Airflow, DBT to create a star-schema and display 3 graphs in a dashboard.

2. Project

Our solution includes the usage of five main technologies:

  • Jupyter: as the interactive platform to run python
  • Jupyter Kernel Gateway: as the API to execute jupyter notebook remotely
  • Postgres: as the database to store the raw data and the star schema modeling
  • Seaborn: as the data visualization tool
  • Apache Airflow: as the scheduling and orchestration tool

alt text

At Apache Airflow, it was implemented a DAG called star_schema.py to perform the ETL. It has the following opereators: alt text

  • run_jupyter_notebook: A PythonOperator that executes a function which makes a request to the jupyter_api service in order to execute the oo_etl_workflow.ipynb notebook (check more details regarding this notebook here). The jupyter_api is a implementation of the Jupyter Kernel Gateway which is a web server that provides headless access to Jupyter kernels making it possible to execute them through REST calls.

  • dbt_run: It is a simple BashOperator that performs the dbt run cmd which will create the models in the Analytics database. Check the details of the Star Schema Model below. alt text

  • dbt_docs_generate: It is a simple BashOperator that performs the dbt_docs_generate cmd that generates the artifact (catalog.json) that provides DBT documentation. This artifact is used by the web service dbt_docs that parse this artifact and provide a user-friendly interface to its content. alt text

  • data_visualization: It is the notebook which implements the seaborn and create the dashboards with the 3 graphs requested for this challenge. It serves both procedural and object-oriented notebooks.

    • Relation between total of services provided by a bank and the number of complains/issues. alt text

    • TOP Banks with more complains/issues. alt text

    • TOP banks with free services (no fee). alt text

3. How to Run

3.1 Jupyter + Postgres

3.1.1 Requirements

3.1.2 Executing the project

On your terminal, execute the following cmd:

$ docker-compose up -d --build

3.1.3 Acessing the services:

  1. Jupyter
url: http://localhost:8888/

Object Oriented Workflow ETL: http://localhost:8888/lab/tree/oo_etl_workflow.ipynb
Dashboard using Seaborn: http://localhost:8888/lab/tree/data_visualization.ipynb
  1. Airflow
url: http://localhost:8888/

user: admin
pwd: admin
  1. Dbt docs
url: http://localhost:5000/

3.2 Diagrams

3.2.1 Requirements

3.2.2 Generating the Diagram

On your terminal, execute the following cmd:

$ python architecture_diagram/architecture.py