Skip to content

Latest commit

 

History

History
50 lines (36 loc) · 3.07 KB

README.md

File metadata and controls

50 lines (36 loc) · 3.07 KB

Data engineering project

Objective

This project takes a look at different data samples gathered from different sources and aims to provide a pipeline that facilitate the analysis of said Data.

The Capstone project template explains in details how the project was formulated and explains the tools ,steps,etc that came in use for this project

The data analysis folder contains visuals and report formulated on the final version of the data using Power BI

The etl.py automates the ETL process for this project and stores the data in AWS s3.

Summary

  • This project makes use of various Big Data processing technologies including:
    • Apache Spark, because of its ability to process massive amounts of data as well as the use of its unified analytics engine and convenient APIs
    • Pandas, due to its convenient dataframe manipulation functions
    • Matplotlib, to plot data and gain further insights
    • AWS s3, due it's data storage feasibility and it's accessibility s3 was chosen to host the final formulated tables to keep track of the data monthly

Datasets:

Data model

A star schema was chosen to host the data needed from the stated datasets,the following figure describes said schema:

AltText

  • The steps for this project is as follows:
    • extract the data from the datasets and perform data preprocessing tasks on them
    • create the Temporary staging tables to host the cleaned data
    • Use the staging tables to populate the star schema
    • Store the fact and dimensions tables in parquet form and upload them to s3 The following diagram explains the steps in visual form :

AltText

Analysis:

The data analysis markdown file contains visuals and information that answers some analytical questions about the data ,this visuals was created using power bi by integrating the s3 bucket to power bi web :

AltText

AltText

Note:

The etl.py script automates the ETl processes of the project in case the data needed to be updated constantly