A serverless ETL pipeline powered by AWS services for automated data extraction, transformation, and loading. Built with S3, Lambda, Glue, EventBridge, SNS, SQS, Terraform, and Python.
- Goals & MVP
- Tech Stack
- How To Use
- Design Goals
- Project Features
- Additions & Improvements
- Learning Highlights
- Known Issues
- Challenges
The goal of this project is to create a scalable and automated ETL pipeline that can handle raw data ingestion, metadata extraction, transformation, and loading of processed data into structured storage. By automating this workflow using AWS Glue, Lambda, and EventBridge, this ETL process becomes efficient, fault-tolerant, and easy to monitor.
- AWS S3
- AWS Lambda
- AWS Glue
- AWS Glue Data Catalog
- Amazon EventBridge
- AWS SNS
- AWS SQS
- Terraform
- Python (Boto3)
- Upload raw data files to the S3 Raw Zone bucket to trigger the ETL pipeline automatically.
- The AWS Lambda function initiates the Glue Crawler to catalog metadata and the ETL job to process data.
- Processed data is stored in the S3 Processed Zone bucket, and a notification is sent via SNS to notify subscribers.
- Scalability: Design the ETL pipeline to scale as data volumes increase.
- Fault Tolerance: Use SQS and event-driven architecture to handle failures gracefully.
- Cost-Efficiency: Leverage serverless resources to minimize costs for idle time.
- Automation: Build a fully automated pipeline that requires minimal manual intervention.
- Automated ingestion of raw data and schema discovery with Glue Crawler
- Serverless data transformation with Glue ETL
- Notifications sent to subscribers upon job completion
- Decoupled messaging using AWS SQS for reliability
- Infrastructure provisioned using Terraform for easy setup and teardown
- Implement data quality checks within the ETL process.
- Add support for data versioning in the S3 Processed Zone.
- Integrate CloudWatch for enhanced logging and monitoring of ETL jobs.
- Add support for multiple data formats (e.g., CSV, Parquet) and dynamic schema handling.
- Using modular steps to build more complex cloud infrastructure
- Setting up and configuring AWS Glue for ETL tasks
- Using Terraform to provision AWS resources and automate IaC
- Implementing event-driven architectures with EventBridge and SQS
- Learning to manage IAM roles and permissions for AWS resources
- Potential race conditions if multiple files are uploaded simultaneously, as the Glue Crawler can only process one job at a time.
- Lambda functions may timeout for very large files, requiring adjustments in timeout settings and memory allocation.
- Defining glue_etl_script.py to execute desired functionality
- Careful IAM permission provisioning
- Defining process driven lambda functions
- Visit my LinkedIn for more details.
- Check out my GitHub for more projects.
- Or send me an email at [email protected]
Thanks for your interest in this project. Feel free to reach out with any thoughts or questions.
Oliver Jenkins © 2024