Technical Documentation

Problem Statement
Solution Summary
Pipeline Flow
Final Deliverables

Problem Statement

Summarize and organize comments submitted by the public on a new proposed rule to enable the OHS content team to streamline formulating informed responses to the ruling and the public’s opinion.

Solution Summary

Our solution uses OpenAI’s GPT 4.0 32k model to provide a first-pass thematic grouping for all comments, limiting the amount of work that needs to be done by the OHS team and increasing the accuracy of those thematic groups compared to the previous method of keyword searching. We then used ChatGPT to prepare a first draft summary of all comments tagged to a given topic. All references in this document to “ChatGPT” should be interpreted as “OpenAI’s GPT 4.0-32k model”. From scoping to delivery, this project took 13 weeks (10 weeks of research and build time; 3 weeks of analyzing the full set of comments).

Manual processes and things to consider for the future:

We were not sure which OpenAI GPT model would be the best fit for this solution. To determine this, we considered the token implications of each model, the location availability within Azure, and the quality of responses from GPT 3.5 vs GPT 4.0
- Tokens are measured in Azure in two ways: the number of tokens sent at once to a model (which are model-specific and exist on the model instance level), and the number of tokens sent over the course of one minute (which is a quota determined by Azure, is model-specific, but exists at the model level, not instance level) . Once we decided on using GPT 4.0, we noticed that sending our prompt along with one text chunk and then receiving a response took about 30 seconds. We didn’t need to worry about the “all-at-once” token limit, because we were sending well under the 32,000 token-at-once limit, so we considered the per-minute quota. We ended up creating nine model instances and splitting out the per-minute quota across nine different models. Managing these token quotas while also balancing pipeline efficiency is something that should be considered in future iterations of this work.
- We decided upon GPT 4.0 after running a side-by-side comparison of a GPT 3.5 model and the GPT 4.0 model. We took our prompt and ran the same set of text chunks through to both GPT 4.0 and GPT 3.5, presenting the results to OHS. This process let them see the differences in response quality between the two models and OHS ultimately decided that GTP 4.0 was better for their use case.
- This 10-week build timeline also included a week that was dedicated to a pipeline test run. We used a previous OHS rule (with an adjusted prompt) and ran through our entire pipeline process to understand how manual portions would be incorporated, to bring issues in our code to light, and to get a better sense of how long each portion of the pipeline would take. We would highly recommend this step of the process be included in the future. It was integral to our success in the short analysis timeline that we had after the comments for the 2023 rule were made available.

Pipeline Flow

Collecting and ingesting data

The public comments and attachments are collected through the Federal Docket Management System (FDMS) and are downloaded as a zip file containing html inline comments and attachments of various file formats. The pipeline scraped all inline comments, converted all attachments, and created a dataframe with all their text. We first download FDMS comments and attachment as a zip file (ACF-2023-0011 2024-01-20 03-01-58.z1p). The file name needs to be modified to become a zip file, so it can be unzipped. Then, run Jupyter notebook file ‘move_files_to_subfolder.ipynb’ to sort content of files into different folder. The PDF, Docx, PPTX, RTF, TXT files will be moved into their own folder, the rest of the files will be moved to ‘Other’ folder. We then needed to convert all these attachment files into machine-readable text. For all file types other than PDF, python code was effective in converting the document to text, but we needed a better way to convert PDFs, and we chose to first convert the PDFs to Docx and then use python to convert the Docx documents. The Jupyter notebook file ‘clean_raw_data.ipynb’ will convert all different types of files including the converted Docx from PDF file into a python Pandas DataFrame saved as ‘2023_scrape.pkl’ file. This file will be uploaded to the AWS virtual machine (ohs-vm-aws) for further processing.