https://github.com/DataTalksClub/data-engineering-zoomcamp
- Video
- Slides
- Overview of Architecture, Technologies & Pre-Requisites
- Introduction to Docker
- Why do we need Docker
- Creating a simple "data pipeline" in Docker
- Ingesting NY Taxi Data to Postgres
- Running Postgres locally with Docker
- Using
pgcli
for connecting to the database - Exploring the NY Taxi dataset
- Ingesting the data into the database
- Note if you have problems with
pgcli
, check this video for an alternative way to connect to your database
- Connecting pgAdmin and Postgres
- The pgAdmin tool
- Docker networks
- Putting the ingestion script into Docker
- Converting the Jupyter notebook to a Python script
- Parametrizing the script with argparse
- Dockerizing the ingestion script
- Running Postgres and pgAdmin with Docker-Compose
- Why do we need Docker-compose
- Docker-compose YAML file
- Running multiple containers with
docker-compose up
- SQL refresher
- Adding the Zones table
- Inner joins
- Basic data quality checks
- Left, Right and Outer joins
- Group by
- Optional: If you have some problems with docker networking, check Port Mapping and Networks in Docker
- Docker networks
- Port forwarding to the host environment
- Communicating between containers in the network
.dockerignore
file
- Optional: If you are willing to do the steps from "Ingesting NY Taxi Data to Postgres" till "Running Postgres and pgAdmin with Docker-Compose" with Windows Subsystem Linux please check Docker Module Walk-Through on WSL