Data Engineering ND #1: Data Modeling With Postgres

Overview

Data Modeling With Postgres is the first project of Udacity Data Engineering Nanodegree. It requires to create a Postgres database for a music streaming app and write an ETL pipeline that transfers data from files in two local directories into this database using Python and SQL.

The project has the following goals:

Model a relation database with Postgres for a star schema
Define fact and dimension tables
Insert data into tables
Create ETL pipeline with Python

Repo Contents

The project includes data folder with song and log datasets, as well as the following files:

test.ipynb: displays the first few rows of each table to check the database.
create_tables.py: drops and creates tables.
etl.ipynb: reads and processes a single file from data/song_data and data/log_dataand loads the data into tables. This notebook contains detailed instructions on the ETL process for each of the tables.
etl.py: reads and processes files from data/song_data and data/log_data and loads them into tables.
sql_queries.py: contains all SQL queries.

Database

Database Purpose

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their music streaming app. The sparkifydb database is designed to optimize queries on song play analysis.

Data Description

The data resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in the app.

Song Dataset `data/song_data`

This dataset is a subset of real data from Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. Below is an example of what a single song file data/song_data/A/A/B/TRAABJV128F1460C49.json looks like.

{
  "num_songs": 1, 
  "artist_id": "ARIK43K1187B9AE54C", 
  "artist_latitude": null, 
  "artist_longitude": null, 
  "artist_location": "Beverly Hills, CA", 
  "artist_name": "Lionel Richie", 
  "song_id": "SOBONFF12A6D4F84D8", 
  "title": "Tonight Will Be Alright", 
  "duration": 307.3824, 
  "year": 1986
}

Log Dataset `data/log_data`

This dataset consists of log files in JSON format generated by eventsim event simulator based on the songs in the song dataset. These simulate activity logs from a music streaming app based on specified configurations. Below is an example of what a single line of a single file data/log_data/2018/11/2018-11-09-events.json looks like.

{
  "artist":"Beastie Boys",
  "auth":"Logged In",
  "firstName":"Harper",
  "gender":"M",
  "itemInSession":2,
  "lastName":"Barrett",
  "length":161.56689,
  "level":"paid",
  "location":"New York-Newark-Jersey City, NY-NJ-PA",
  "method":"PUT",
  "page":"NextSong",
  "registration":1540685364796.0,
  "sessionId":275,
  "song":"Lighten Up",
  "status":200,
  "ts":1541722186796,
  "userAgent":"\"Mozilla\/5.0 (Windows NT 6.3; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"",
  "userId":"42"
}

Database Design

The sparkifydb is a Postgres database filled with data from songs and log datasets. It is designed with star schema to optimize queries on song play analysis. The sparkifydb database contains the following tables:

Fact table
- songplays: records in log data associated with song plays, i.e. records with page NextSong
Dimension tables
- users: users in the app
- songs: songs in music database
- artists: artists in music database
- time: timestamps of records in songplays broken down into specific units

The sparkifybd schema is shown below. Primary and foreign keys are marked as PK and FK, respectively.

Getting Started

To run ETL pipeline locally,

Clone this repo.
cd into project directory.
Run create_tables.py to create database and tables:

pyhton create_tables.py

To confirm the creation of tables with correct columns, run test.ipynb. Make sure to click Restart kernel to close the connection to the database after running this notebook.

Run etl.py to fill the tables:

python etl.py

Remember to run create_tables.py before running etl.py to reset tables.

Authors

Alexandra Baturina

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Engineering ND #1: Data Modeling With Postgres

Overview

Repo Contents

Database

Database Purpose

Data Description

Song Dataset `data/song_data`

Log Dataset `data/log_data`

Database Design

Getting Started

Authors

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

alexandrabaturina/data-modeling-with-postgres

Folders and files

Latest commit

History

Repository files navigation

Data Engineering ND #1: Data Modeling With Postgres

Overview

Repo Contents

Database

Database Purpose

Data Description

Song Dataset data/song_data

Log Dataset data/log_data

Database Design

Getting Started

Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Song Dataset `data/song_data`

Log Dataset `data/log_data`

Packages