NYC 311 Data Quickstart

This README file includes instructions to replicate our process.

The final results are available here:

http://people.ischool.berkeley.edu/~samuel.goodgame/311/

Part I: Static Data Transformation with Hive

This section covers how to pull a very large .csv file out of the NYC OpenData website, impose schema on it, and manipulate it in Hive.

Environment Setup and Acquiring the Data

Launch and start an AWS EC2 instance with Hadoop, Hive, HDFS, and PostgreSQL. The AMI ami-be0d5fd4 has everything necessary for the project. Instructions can be found here and here.

Ensure you have an EBS volume attached to the instance; you'll need to it store the data in your /data directory.

Create the following open TCP ports on 0.0.0.0/0 (all ip addresses, in practice you will want to limit this to as neccessary)

22 (SSH)
7180 (Cloudera Manager)
8080 (Webserver)
10000 (Hive)
50070 (Hadoop)

Navigate to the /data directory. If you're using an EBS volume, mount it to the /data directory to save the files you place there. (You'll need to use a command like this: sudo mount -t ext4 /your/device_location /data. You can view your attached volumes by entering the Unix commands $ df -k and $ fdisk -l.)
Download the data. This data is a slightly modified version of the full dataset that does not include all of the fields (and is therefore more manageable):

$ wget https://data.cityofnewyork.us/api/views/dbhh-68ph/rows.csv?accessType=DOWNLOAD

Strip the headers from your data (necessary for Hive):

$ tail -n +2 311_data.csv > 311_data_no_headers.csv

Change the user to w205 (if you're using the UCB AMI):

$ su - w205

Make a directory in HDFS for your data:

$ hdfs dfs -mkdir /user/w205/final_data

Put your new .csv file into HDFS:

$ hdfs dfs -put 311_data_no_headers.csv /user/w205/final_data

Imposing schema

Enter Hive:

$ hive

Load the data into HDFS with the found in this repository's /hive directory under hdfs_schema.sql.

Transform the data

Within Hive, run transformations that output aggregations of the total table. There are four Hive SQL scripts located within the /hive directory of this repository; run each of them to produce four separate tables:

avg_complaint_coords.sql
complaints_by_type_by_city.sql
least_common_complaints.sql
top_complaint_types.sql

While still in the Hive interpreter, enter $ show tables to confirm that the tables exist. You can also query the first few line of each table to inspect the data with a query like this:

$ SELECT * FROM avg_complaint_coords LIMIT 10.

Visualize the Data in Tableau

Within your EC2 instance, run the following command:

hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10000 &.

This ensures that Tableau will be able to read from your Hive tables.

Download Tableau Desktop here. You can use a free trial if needed. Tableau also has free options for students.
Within Tableau, connect to a new data source. Select Cloudera Hadoop as the data source type. Enter your EC2 server connection information for Server, "HiveServer2" for Type, and use whatever username you used to create the Hive tables.
Once the connection is successful, enter default for "schema" and select enter. You should see a text field for entering tables; type one of your tables exactly as it exists in hive (e.g., "avg_complaint_coords".)
Drag your table into the data window, and begin creating visualizations! You can see examples of the ones we've made on our website.

Part II: Socrata 311 API

Python libraries to pull incremental 311 data from the Socrata Open Data API

Background

For the purpose of refreshing our data on a daily basis, we decided to leverage the existing SODA Consumer API to pull the daily updates registered in the NYC 311 data set.

The goal was to be able to abstract some of the logic needed to produce the data needed for our various reports in the form of simple, callable methods as well as passing on some of the data manipulation load into the hosting service (thus saving storage and compute power on our end ) .

Similarly, the abstraction allow us to further refine the logic to transform the data pulled from the api with minimal disruption to the consumption layer of our data (for the purposes of this project Tableau or direct function calls on Python)

Requirements

At its core, this library depends heavily on the 'sodapy' and 'requests' Python packages.

Documentation

The official Socrata API docs provide thorough documentation of the available methods, however,the set of methods written cater exclusively towards the NYC 311 dataset, which is fully documented here.

settings

Primarily created to host a few constants referenced through the package. Includng:

APP_TOKEN_311: api token generated when registering an app in socrata. This is a private token currently being shared as part of the project. It will be deactivated/decomissioned after September 15th for obvious reasons!
APP_NYC_DATASET: this is Socrata's dataset key. Static, no need to change
APP_NYC_API_DOMAIN: this is Socrata's api domain url. Static, no need to change

etl

This module hosts a handful of methods used to extend the data pulled from the api with additional attributes needed for 'time to closure' analysis.

etl_settings

Holds a couple of simple settings needed to standardize the data before exporting (when needed needed)

api

pull_data_modified_since:

Can be used to simply pull data modified in Socrata since a given timestamp

pull_data_created_since

Can be used to simply pull data created in Socrata since a given timestamp

pull_agg_closure_statistics_created_since:

The main purpose of this method is to return the percentage closure stats for a given group key (list of fields to group by). The scope is all records created since the passed timestamp

pull_raw_time_to_closure_statistics_created_since_closed_only

Not widely used due to the vasts amounts of data pulled (not restrictive enough). This method essentially pulls all the raw data (at the service request level) created since a given timestamp and calculates the time to closure metrics.

pull_agg_time_to_closure_statistics_created_since_closed_only

This method calculates the average time to closure per group key for all complaints received since the desired timestamp

pull_full_closure_statistics_since_date

This method calculates the average time to closure per group key for all complaints received since the desired timestamp and combines it with the percentage closed ratio metrics

pull_full_closure_statistics_since_x_weeks

This method calculates the average time to closure per group key for all complaints received since the desired number of weeks (in the past) and combines it with the percentage closed ratio metrics

pull_daily_closure_statistics_since_x_weeks

This method returns groupings of the passed key (granularity) and number of days to closure, the main idea is to use it to generate histograms and dive into a specific type of complaints

pull_closed_within_a_day_stats_since_x_weeks

This method returns the ratio of complaints closed within a day

How to run a test data pull

See 'run_data_socrata_api.py' script for three sample calls to relevant package methods. The examples pull data and calculate relevant metrics for a window of 12 weeks since call time.

Intended use in future work

This package lays the foundation for future work on generating actionable insights on the 311. The recommended next steps are to leverage these methods to refresh (1) the life-to-date hdfs files used for broader tableau reporting as wel as (2) persisting the aggregated results in a PostgreSql instance so as to maintain a live connection with the reporting layer (Tableau) and be able to take daily snapshots and perform time series analysis on time to closure.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.idea		.idea
.spyproject		.spyproject
hive		hive
src		src
.gitignore		.gitignore
README.md		README.md
Visualizing New York City 311 Data.pptx		Visualizing New York City 311 Data.pptx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC 311 Data Quickstart

Part I: Static Data Transformation with Hive

Environment Setup and Acquiring the Data

Imposing schema

Transform the data

Visualize the Data in Tableau

Part II: Socrata 311 API

Background

Requirements

Documentation

Package Structure

settings

etl

etl_settings

api

How to run a test data pull

Intended use in future work

About

Releases

Packages

Contributors 2

Languages

romulomanzano/NYC-311-Data-Analytics

Folders and files

Latest commit

History

Repository files navigation

NYC 311 Data Quickstart

Part I: Static Data Transformation with Hive

Environment Setup and Acquiring the Data

Imposing schema

Transform the data

Visualize the Data in Tableau

Part II: Socrata 311 API

Background

Requirements

Documentation

Package Structure

settings

etl

etl_settings

api

How to run a test data pull

Intended use in future work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages