Skip to content

Latest commit

 

History

History
40 lines (27 loc) · 1.42 KB

README.md

File metadata and controls

40 lines (27 loc) · 1.42 KB

Airflow Pipeline

ETL created using Airflow. This project load data from S3 buckets and write the data into staging, fact and dimension tables. Using PostgresOperator and PythonOperator we are able to create Fact and Dimension tables for a star-schema.

This project runs data quality check into dimension and fact tables checking null values and empty tables.

Graph View

Main Dag

alt text

Schedule:

default_args = {
    'owner': 'airflow',
    'start_date': datetime(2018, 11, 1),
    'end_date:': datetime(2018, 11, 30),
    'email_on_retry': False,
    'email_on_failure': False,
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'depends_on_past': False
}

Beside default parameters above, this DAG runs hourly with a max 1 active run at the same time.

start_date context is used to load data from S3 (CSV files). All hooks are created with the goal of being flexible. This means that you can define is you want to delete or just append the data into dimension and fact tables. You can use different connection id too. Just make sure that the Hook fit your purpose

Subdag

alt text

Subdag created to write all dimension tables. The entire subdag use LocalExecutor() method to runs all taks at the same time.

Development

Want to contribute? Great! please feel free to open issues and push.