GitHub - jimcrozier/fars

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
R		R
documentation		documentation
.gitignore		.gitignore
README		README
fars_data.Rproj		fars_data.Rproj

Repository files navigation

The FARS dataset contains information on every fatal accident in the US from 1975 - 2015. 
There are roughly 35k accidents per year so the data starts getting pretty big. 

This analysis is the beginning skelton of a project to pull down all data from the FARS dataset, add them to hdfs,
and analyze with sparklyr. 

To run, make sure that you have postgres, hdfs and hive all set with the specifications in the code, and that the 
absolute working directories are changed to yours. You will also need access to the shell, so I am not sure 
how well this will work in windows. 

1. download_data_gen_shell.R #make sure to run the shell script that it creates
2. load_data_into_postgres.R 
3. sqoop_from_postgres_into_hive_shell_gen.R #make sure to run the shell script that it creates
4. hive_concat_all_years_shell_gen.R #make sure to run the shell script that it creates
5. spark_analysis.R