Skip to content

Using AWS s3,SQL,python and Pyspark, build data pipelines to provide the analytics department with the required formatted data

Notifications You must be signed in to change notification settings

GasserEmad11/data-engineering-project

Repository files navigation

Data engineering project

Objective

This project takes a look at different data samples gathered from different sources and aims to provide a pipeline that facilitate the analysis of said Data.

The Capstone project template explains in details how the project was formulated and explains the tools ,steps,etc that came in use for this project

The data analysis folder contains visuals and report formulated on the final version of the data using Power BI

The etl.py automates the ETL process for this project and stores the data in AWS s3.

Summary

  • This project makes use of various Big Data processing technologies including:
    • Apache Spark, because of its ability to process massive amounts of data as well as the use of its unified analytics engine and convenient APIs
    • Pandas, due to its convenient dataframe manipulation functions
    • Matplotlib, to plot data and gain further insights
    • AWS s3, due it's data storage feasibility and it's accessibility s3 was chosen to host the final formulated tables to keep track of the data monthly

Datasets:

Data model

A star schema was chosen to host the data needed from the stated datasets,the following figure describes said schema:

AltText

  • The steps for this project is as follows:
    • extract the data from the datasets and perform data preprocessing tasks on them
    • create the Temporary staging tables to host the cleaned data
    • Use the staging tables to populate the star schema
    • Store the fact and dimensions tables in parquet form and upload them to s3 The following diagram explains the steps in visual form :

AltText

Analysis:

The data analysis markdown file contains visuals and information that answers some analytical questions about the data ,this visuals was created using power bi by integrating the s3 bucket to power bi web :

AltText

AltText

Note:

The etl.py script automates the ETl processes of the project in case the data needed to be updated constantly

About

Using AWS s3,SQL,python and Pyspark, build data pipelines to provide the analytics department with the required formatted data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published