The main challenge is ingest data from a CSV and API using Apache Airflow, DBT to create a star-schema and display 3 graphs in a dashboard.
Our solution includes the usage of five main technologies:
- Jupyter: as the interactive platform to run python
- Jupyter Kernel Gateway: as the API to execute jupyter notebook remotely
- Postgres: as the database to store the raw data and the star schema modeling
- Seaborn: as the data visualization tool
- Apache Airflow: as the scheduling and orchestration tool
At Apache Airflow, it was implemented a DAG called star_schema.py
to perform the ETL. It has the following opereators:
-
run_jupyter_notebook: A PythonOperator that executes a function which makes a request to the
jupyter_api
service in order to execute theoo_etl_workflow.ipynb
notebook (check more details regarding this notebook here). Thejupyter_api
is a implementation of the Jupyter Kernel Gateway which is a web server that provides headless access to Jupyter kernels making it possible to execute them through REST calls. -
dbt_run: It is a simple BashOperator that performs the
dbt run
cmd which will create the models in the Analytics database. Check the details of the Star Schema Model below. -
dbt_docs_generate: It is a simple BashOperator that performs the
dbt_docs_generate
cmd that generates the artifact (catalog.json) that provides DBT documentation. This artifact is used by the web servicedbt_docs
that parse this artifact and provide a user-friendly interface to its content. -
data_visualization: It is the notebook which implements the seaborn and create the dashboards with the 3 graphs requested for this challenge. It serves both procedural and object-oriented notebooks.
On your terminal, execute the following cmd:
$ docker-compose up -d --build
- Jupyter
url: http://localhost:8888/
Object Oriented Workflow ETL: http://localhost:8888/lab/tree/oo_etl_workflow.ipynb
Dashboard using Seaborn: http://localhost:8888/lab/tree/data_visualization.ipynb
- Airflow
url: http://localhost:8888/
user: admin
pwd: admin
- Dbt docs
url: http://localhost:5000/
On your terminal, execute the following cmd:
$ python architecture_diagram/architecture.py