Welcome to Exercise 02. This exercise provides data from the National Transportation Safety Board's database of aviation accidents. We'll ask you to perform some routine high-level analytic tasks with the data.
If you wish to submit via GitHub:
- Fork this repository to your personal GitHub account and clone the fork to your computer. If you've received this repo as a zip file, ignore
- Note: This does mean you will have a visible public fork of this repo on your github account.
- Save and commit your answers to your fork of the repository, and push them back to your forked repository. Include code and writeup.
- Provide a link to that fork of the repository in your submission.
If you wish to submit via an emailed zip file:
- Clone this repository or download the code as a zip file (see image below)
- Extract this repo (if downloaded as zip) and work locally with the code.
- When finished, zip your project folder (code and writeup) and submit to the email you received the exercise from. Do not encrypt or add a password to the zip.
- Use open source tools and ecosystems - Python or R. Do not use proprietary tools, such as SAS, SPSS, JMP, Tableau, or Stata.
- Use the Internet as a resource to help you complete your work. We do it all the time.
- Comment your code such that a fellow data scientist who isn't familiar with this data or analysis could understand the steps you take.
- There are many ways to approach and solve the problems presented in this exercise.
- For language specific information on how to read in XML and JSON files, use your favorite search engine.
You will be exploring the data to develop a classification of narratives and writing up a summary of the data and your results. The whole exercise should take no longer than 4 hours (self-timed).
Your code needs to perform the following tasks:
- Read and standardize the json files in a way that facilitates further analysis (i.e. "flatten" them and link with
AviationData
) - Prepare descriptive statistics that convey an overview of the structured data.
- Perform initial exploratory analysis of the narrative text, analyzing the use of words over time.
- Use topic modeling or any other text clustering methodology to cluster/group the incidents based on the narrative text and/or probable cause descriptions. Come up with a short name for each topic or cluster you identify to make it easy to report on.
- Create a chart that you feel conveys one important relationship in the data.
Your writeup should do the following:
- Describe your methodology and results in 500 words or less.
- Include the chart generated as of your write-up. Explain how the chart informs your analysis.
- You'll not be punished for going over 500 words, but it is a rough guideline of the length we expect.
- Include a chart that you feel conveys one important relationship in the data.
Additional Context:
- Assume the audience for your write-up is a non-technical stakeholder.
- Assume the audience for your code is a colleague who may need to read or modify it in the future.
There are 146 files in this repository (data/
):
AviationData.xml
: This is a straight export of the database provided by the NTSB. The data has not been altered. It was retrieved by clicking "Download All (XML)" from this page on the NTSB site.AviationData.csv
: This is a CSV version of the XML above, pre-converted to a tabular format so that it's easier to work with for this analysis. The script used for conversion isparse_xml.py
There are 144 files in the following format:
NarrativeData_xxx.json
: These files were created by taking theEventId
s fromAviationData.xml
and collecting two additional pieces of data for each event:narrative
: This is the written narrative of an incident.probable_cause
: If the full narrative includes a "probable cause" statement, it is in this field.