AWS ETL Pipeline (Serverless)

Project Overview

A serverless ETL pipeline powered by AWS services for automated data extraction, transformation, and loading. Built with S3, Lambda, Glue, EventBridge, SNS, SQS, Terraform, and Python.

Screenshot

Goals & MVP

The goal of this project is to create a scalable and automated ETL pipeline that can handle raw data ingestion, metadata extraction, transformation, and loading of processed data into structured storage. By automating this workflow using AWS Glue, Lambda, and EventBridge, this ETL process becomes efficient, fault-tolerant, and easy to monitor.

Tech Stack

AWS S3
AWS Lambda
AWS Glue
AWS Glue Data Catalog
Amazon EventBridge
AWS SNS
AWS SQS
Terraform
Python (Boto3)

How To Use

Upload raw data files to the S3 Raw Zone bucket to trigger the ETL pipeline automatically.
The AWS Lambda function initiates the Glue Crawler to catalog metadata and the ETL job to process data.
Processed data is stored in the S3 Processed Zone bucket, and a notification is sent via SNS to notify subscribers.

Design Goals

Scalability: Design the ETL pipeline to scale as data volumes increase.
Fault Tolerance: Use SQS and event-driven architecture to handle failures gracefully.
Cost-Efficiency: Leverage serverless resources to minimize costs for idle time.
Automation: Build a fully automated pipeline that requires minimal manual intervention.

Project Features

Automated ingestion of raw data and schema discovery with Glue Crawler
Serverless data transformation with Glue ETL
Notifications sent to subscribers upon job completion
Decoupled messaging using AWS SQS for reliability
Infrastructure provisioned using Terraform for easy setup and teardown

Additions & Improvements

Implement data quality checks within the ETL process.
Add support for data versioning in the S3 Processed Zone.
Integrate CloudWatch for enhanced logging and monitoring of ETL jobs.
Add support for multiple data formats (e.g., CSV, Parquet) and dynamic schema handling.

Learning Highlights

Using modular steps to build more complex cloud infrastructure
Setting up and configuring AWS Glue for ETL tasks
Using Terraform to provision AWS resources and automate IaC
Implementing event-driven architectures with EventBridge and SQS
Learning to manage IAM roles and permissions for AWS resources

Known Issues

Potential race conditions if multiple files are uploaded simultaneously, as the Glue Crawler can only process one job at a time.
Lambda functions may timeout for very large files, requiring adjustments in timeout settings and memory allocation.

Challenges

Defining glue_etl_script.py to execute desired functionality
Careful IAM permission provisioning
Defining process driven lambda functions

Contact Me

Visit my LinkedIn for more details.
Check out my GitHub for more projects.
Or send me an email at [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
data		data
lambda_functions		lambda_functions
scripts		scripts
terraform		terraform
zipped_lambda_functions		zipped_lambda_functions
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
aws-etl-pipeline.png		aws-etl-pipeline.png
cli-checks.md		cli-checks.md
commands.md		commands.md
glue_etl_script.py		glue_etl_script.py
process.md		process.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS ETL Pipeline (Serverless)

Project Overview

Screenshot

Table of Contents

Goals & MVP

Tech Stack

How To Use

Design Goals

Project Features

Additions & Improvements

Learning Highlights

Known Issues

Challenges

Contact Me

About

Releases

Packages

Languages

cyberforge1/aws-etl-pipeline

Folders and files

Latest commit

History

Repository files navigation

AWS ETL Pipeline (Serverless)

Project Overview

Screenshot

Table of Contents

Goals & MVP

Tech Stack

How To Use

Design Goals

Project Features

Additions & Improvements

Learning Highlights

Known Issues

Challenges

Contact Me

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages