Data Warehouse with Aws Redshift

Summary

A music streaming company called Spakify aims to build a solution to analyze the user behavior providing music recommendations and improvements in their services. As the company grow up the volume of data grow from gigabytes to terabytes, so they need a robust solution to persist, query and analyze the data.

The original data are JSON files stored in S3 buckets and querying this files directly become very difficult, so they need a better solution.

To attend these requirements I implemented a Star Schema using the RedShift Cluster optimized to song plays analyze.

Original Data

The original dataset is composite for two different kinds of files:

Song Dataset
Log Dataset

Song Dataset

The song dataset is in JSON format and contains metadata about a song and the artist of that song.

Below you can see an example of how song file looks like:

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Dataset

The second dataset consists of log files in JSON format. The log file contains the activity logs from a music streaming app.

Data Model

Fact Table

songplays : records in log data associated with song plays

songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users : users in the app

user_id, first_name, last_name, gender, level

songs : songs in music database

song_id, title, artist_id, year, duration

artists : artists in music database

artist_id, name, location, lattitude, longitude

time : timestamps of records in songplays broken down into specific units

start_time, hour, day, week, month, year, weekday

Code

Command Line Application

Create/drop tables create_tables.py

$ python create_tables.py

ETL - read files from S3 -> insert into stagging tables -> query from staging tables -> transform -> insert into final STAR schemas tables

$ python etl.py

Prerequisites

The Code is written in Python 3.6.3 . If you don't have Python installed you can find it here. If you are using a lower version of Python you can upgrade using the pip package, ensuring you have the latest version of pip.

To install pip run in the command Line:

$ python -m ensurepip -- default-pip

To upgrade pip:

$ python -m pip install -- upgrade pip setuptools wheel

To upgrade Python:

$ pip install python -- upgrade

Additional Package: Psycopg2. You can donwload them using pip Psycopg2:

$ pip install psycopg2

You also need a Redshift Cluster up and running. You can read the documentation and configure your own cluster.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
img		img
README.md		README.md
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Warehouse with Aws Redshift

Summary

Original Data

Song Dataset

Log Dataset

Data Model

Fact Table

songplays : records in log data associated with song plays

Dimension Tables

users : users in the app

songs : songs in music database

artists : artists in music database

time : timestamps of records in songplays broken down into specific units

Code

Command Line Application

Create/drop tables create_tables.py

ETL - read files from S3 -> insert into stagging tables -> query from staging tables -> transform -> insert into final STAR schemas tables

Prerequisites

About

Releases

Packages

Languages

rrm86/redshift_data_warehouse

Folders and files

Latest commit

History

Repository files navigation

Data Warehouse with Aws Redshift

Summary

Original Data

Song Dataset

Log Dataset

Data Model

Fact Table

songplays : records in log data associated with song plays

Dimension Tables

users : users in the app

songs : songs in music database

artists : artists in music database

time : timestamps of records in songplays broken down into specific units

Code

Command Line Application

Create/drop tables create_tables.py

ETL - read files from S3 -> insert into stagging tables -> query from staging tables -> transform -> insert into final STAR schemas tables

Prerequisites

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages