Skip to content

Latest commit

 

History

History

week_2_data_ingestion

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Week 2: Data Ingestion

Data Lake (GCS)

  • What is a Data Lake
  • ELT vs. ETL
  • Alternatives to components (S3/HDFS, Redshift, Snowflake etc.)

🎥 Video

Slides

Introduction to Workflow orchestration

  • What is an Orchestration Pipeline?
  • What is a DAG?
  • Video

Setting up Airflow locally

If you want to run a lighter version of Airflow with fewer services, check this video. It's optional.

Ingesting data to GCP with Airflow

  • Extraction: Download and unpack the data
  • Pre-processing: Convert this raw data to parquet
  • Upload the parquet files to GCS
  • Create an external table in BigQuery
  • Video

Ingesting data to Local Postgres with Airflow

  • Converting the ingestion script for loading data to Postgres to Airflow DAG
  • Video

Transfer service (AWS -> GCP)

Moving files from AWS to GCP.

You will need an AWS account for this. This section is optional

Homework

In the homework, you'll create a few DAGs for processing the NY Taxi data for 2019-2021

More information here

Community notes

Did you take notes? You can share them here.