Repository for storing data cleaning, processing, expansion, and summarization scripts for a survey conducted at SDIA in the Fall of 2025.
The data manipulation procedures are stored in a series of Jupyter notebooks (with a single R notebook) in the /notebooks
directory. The data cleaning and processing is aided by a Pydantic data model, which is stored in the data_model
folder. MkDocs is used to create API documentation of the data model, which provides one method to quickly query the data model, which serves as a data dictionary. The docs
directory contains the markdown files used in the API documentation website, which is configured via the mkdocs.yml
file and created by GitHub using GitHub Actions (see /.github/workflows/docs.yml
).
- The Python environment used in the GitHub Actions to create the API documentation website is specified in the
/.github/workflows/docs.yml
file with the requirements defined in thedocs_requirements.txt
file. - The Python environment used to run the Jupyter notebooks is Python 3.12, with the requirements defined in the
requirements.txt
file. See thecreate-env.bat
file for commands needed to create a virtual environment to run the notebooks. If using VS Code and you cannot find the kernal in the list of options, try running theadd_kernal_to_jupyter.bat
instructions and then restarting VS Code. - The R environment used to run the single R notebook is R 2023.12.1+402. The notebook uses only the
tidyverse
andyaml
third-party libraries.
The steps to process the data are as follows:
This Python notebooks starts with the "raw" data delivered by the field team. The raw data is expected to be stored in the /data/external/etc
folder as Excel files. These files are not checked into GitHub. This notebook engaged in a bit of hand-to-hand combat to do some cleaning up of the data and joining survey data collected at different stages in the process.
A dictionary is used in this script to rename variables. See /data/processed/revised_names.csv
. This moves the data from the variable names used by the field team, which are useful for keeping track of question sequencing, and the variable names we want in the end data product.
The data model expects the data to be serialized in a specific manner. Specifically, it expects that the data is a list of persons who have made trips to the airport. In the data model, each Respondent
has a Trip
. This notebook serializes the data with this code:
respondent_list = add_list_objects(
trips_df.to_dict(orient="records"), #child list
"respondentid", # child key
persons_df.to_dict(orient="records"), # parent list
"respondentid", # parent key
"trip", # parent var
)
The Python Pydantic package makes validating data with a data model straightforward. The code that does this looks like this:
employee_list = []
air_passenger_list = []
other_list = []
failed_records = []
for respondent in respondent_list:
market_segment = respondent["marketsegment"]
try:
if market_segment == e.Type.EMPLOYEE:
ev = Employee(** respondent)
employee_list.append(ev)
elif market_segment == e.Type.PASSENGER:
av = AirPassenger(** respondent)
air_passenger_list.append(av)
else:
rv = Respondent(** respondent)
other_list.append(rv)
except ValidationError as err:
respondent['error_flag'] = 'failed'
respondent['error_message'] = str(err)
failed_records.append(respondent)
failed_df = pd.DataFrame(failed_records)
failed_df.head()
This code first determines whether or not the relevant record is an Employee
or an AirPassenger
(both of which are based on the Respondent
data model class) and then instantiates the relevant data model class. This instantiation does the validation. If the validation is unsuccessful, these "failed" records are stored in a separate data frame.
The sampling frame of the survey was passengers departing the airport. To augment the data, we create synthetic records of travelers leaving the airport using data from the survey. The method for doing this is stored in /data_model/utils.py
.
The rest of the steps in the notebook "un-serialize" the data to create the expected flat file, use the enumerated variables stored in enums.py
to label the data, and give each record a unique identifier. The outcome of this process is a csv file stored in /data/processed
directory called data_model_output.csv
.
The survey expansion is done in R. It starts with the data_model_output.csv
and uses the following files:
a. A yaml
configuration file. See /data/interim/expansion_config_*.yml
.
b. A csv
controls file. See /data/interim/expansion_controls_*.csv
. This file is specified in the configuration file. The controls are the expansion targets.
c. A csv
controls expression file. See /data/interim/expansion_controls_expressions_*.csv
. This file is specified in the configuration file. The expressions show how to compute the controls from the survey data.
Configuration and control files are in place for two expansions: one using only departing passengers and employees; and one using departing passengers, arriving passengers (including synthetic records), and employees. The relevant files are labeled as either departing_only
or departing_and_arriving
.
The expansion uses an iterative optimization methodology to fit each control by adjusting the record weights.
Users can toggle the expansion by adjusting the configuration file in this line of code:
config_file <- paste0(interim_dir, "expansion_config_departing_and_arriving.yaml")
Please also note that the working directory is set in the code to be the procedures robust against different type of execution types (e.g., running line by line, "knitting" to create an HTML record of the script). Users will need to adjust the working directory prior to executing the code locally.
The script generates numerous diagnostic data, as well as a csv
file that contains the expansion weight for each record, with a separate file for each expansion approach.
This notebook creates lightly formatted variable summaries in Microsoft Word format. It joins the data model output from step 1 with the weights generated in step 2. The summaries currently use the departing only set of weights.
This is a utility notebook used for doing ad hoc investigation of mode choice outcomes across the survey period. It is not part of the core data processing workflow.
This is a utility notebook used to compare the outcomes from the pilot phase of the survey and the non-pilot phase of the survey.
The data model can be examined and queried via the API documentation. For many users, this will be the preferred way to examine variable definitions and response options. Others may prefer a more conventional spreadsheet. This script creates a Microsoft Excel spreadsheet from the data model. This allows users to examine a data dictionary in Excel, while maintaining a single source of truth (the data model). See /reports/data_dictionary.xlsx
.
Users may also want to view the data dictionary as a Microsoft Word file. Having a Word version is also useful for static reports. This script renders the data dictionary created in the previous step as a Word document. See /reports/data_dictonary.docx
.
There is no perfect way to archive the survey instrument, as its a web-based instrument with skip logic. Screenshots of the questionnairre have been stored. In this script, an Excel version of the questionnaire is rendered as a Word document for easier review.