Summarize and organize comments submitted by the public on a new proposed rule to enable the OHS content team to streamline formulating informed responses to the ruling and the public’s opinion.
Our solution uses OpenAI’s GPT 4.0 32k model to provide a first-pass thematic grouping for all comments, limiting the amount of work that needs to be done by the OHS team and increasing the accuracy of those thematic groups compared to the previous method of keyword searching. We then used ChatGPT to prepare a first draft summary of all comments tagged to a given topic. All references in this document to “ChatGPT” should be interpreted as “OpenAI’s GPT 4.0-32k model”. From scoping to delivery, this project took 13 weeks (10 weeks of research and build time; 3 weeks of analyzing the full set of comments).
Manual processes and things to consider for the future:
-
We were not sure which OpenAI GPT model would be the best fit for this solution. To determine this, we considered the token implications of each model, the location availability within Azure, and the quality of responses from GPT 3.5 vs GPT 4.0
-
Tokens are measured in Azure in two ways: the number of tokens sent at once to a model (which are model-specific and exist on the model instance level), and the number of tokens sent over the course of one minute (which is a quota determined by Azure, is model-specific, but exists at the model level, not instance level) . Once we decided on using GPT 4.0, we noticed that sending our prompt along with one text chunk and then receiving a response took about 30 seconds. We didn’t need to worry about the “all-at-once” token limit, because we were sending well under the 32,000 token-at-once limit, so we considered the per-minute quota. We ended up creating nine model instances and splitting out the per-minute quota across nine different models. Managing these token quotas while also balancing pipeline efficiency is something that should be considered in future iterations of this work.
-
We decided upon GPT 4.0 after running a side-by-side comparison of a GPT 3.5 model and the GPT 4.0 model. We took our prompt and ran the same set of text chunks through to both GPT 4.0 and GPT 3.5, presenting the results to OHS. This process let them see the differences in response quality between the two models and OHS ultimately decided that GTP 4.0 was better for their use case.
-
This 10-week build timeline also included a week that was dedicated to a pipeline test run. We used a previous OHS rule (with an adjusted prompt) and ran through our entire pipeline process to understand how manual portions would be incorporated, to bring issues in our code to light, and to get a better sense of how long each portion of the pipeline would take. We would highly recommend this step of the process be included in the future. It was integral to our success in the short analysis timeline that we had after the comments for the 2023 rule were made available.
-
The public comments and attachments are collected through the Federal Docket Management System (FDMS) and are downloaded as a zip file containing html inline comments and attachments of various file formats. The pipeline scraped all inline comments, converted all attachments, and created a dataframe with all their text. We first download FDMS comments and attachment as a zip file (ACF-2023-0011 2024-01-20 03-01-58.z1p). The file name needs to be modified to become a zip file, so it can be unzipped. Then, run Jupyter notebook file ‘move_files_to_subfolder.ipynb’ to sort content of files into different folder. The PDF, Docx, PPTX, RTF, TXT files will be moved into their own folder, the rest of the files will be moved to ‘Other’ folder. We then needed to convert all these attachment files into machine-readable text. For all file types other than PDF, python code was effective in converting the document to text, but we needed a better way to convert PDFs, and we chose to first convert the PDFs to Docx and then use python to convert the Docx documents. The Jupyter notebook file ‘clean_raw_data.ipynb’ will convert all different types of files including the converted Docx from PDF file into a python Pandas DataFrame saved as ‘2023_scrape.pkl’ file. This file will be uploaded to the AWS virtual machine (ohs-vm-aws) for further processing.
Manual processes and things to consider for the future:
- We explored using OCR packages in python to convert pdfs to Word, but found they performed poorly at recognizing line breaks. We also considered using ChatGPT’s OCR capabilities to convert documents, but this seemed an unnecessary use of tokens when free options were available. A team member already had an Adobe Acrobat License, which met performance expectations without increasing project cost. This required downloading the folder of pdf documents, converting them locally and reuploading them. Adobe Acrobat successfully converted 196 out of 199 pdfs. An additional 16 documents, which were flagged by content filters - were modified manually to remove excessive filler content (footnotes, signatures, charts). The original versions were overwritten by the modified versions in the upload.
For each comment number in our dataset, the corresponding attachments were converted to a text string and added to pandas dataframe, with one row per attachment. The rows were then evaluated to flag comments from entities identified by the Office of Head Start (OHS) for the purposes of early review. Head Start Association commenters were flagged using a regular expression search for “Head Start” and “Association” and we manually reviewed the list of distinct ‘Government Agency’ entries to manually flag those in the federal government. The resulting pandas dataframe was saved as a pickle file called '2023_scrape.pkl', and the “important commenter” data was saved separately to share with OHS so they could start reviewing those comments as soon as possible.
Next the text was cleaned and formatted for ChatGPT. Duplicate comments and attachments were removed, and duplicate comment / attachment counts were created. The text was cleaned of foreign characters and non-semantic content to reduce token counts. Once duplicates were removed, any rows with no comment or attachment text were dropped. Non-English comments were identified and translated via ChatGPT. Comment and attachment text were combined, and chunked into sections of 800 words, so as to stay within the ChatGPT cognitive capacity limit of the selected model. The final chunked dataframe was saved, as well as logging documents.
Manual processes and things to consider for the future:
- Government commenters were initially flagged by importing a list of all representatives, and checking for exact matches; however, this failed to identify matches. Instead, we manually reviewed the list of distinct ‘Government Agency’ entries to manually flag those in the federal government.
- The aforementioned data loading and cleaning code was run several times as lower quality attachments were identified and modified. Lower quality attachments were flagged as “do not run” by comparing the attachment text against a series of quality thresholds, which were predetermined from testing on 2015 data. The flagged attachments were exported and evaluated by the team. Obvious data-ingest issues – primarily excessive line breaks – or excessive filler content were addressed, and modified versions of the document were re-uploaded (n=16 documents referred to above). The cleaning code was then rerun, and the updated flagged attachments were sent to the client for review. This list of attachments also included comment with multiple attachments because our review found that most comments with multiple attachments contained reference materials as opposed to actionable comments. The adjudications from the client were turned into a list and added to the code to override the “do not run” flags. The final manual data cleaning step focused on identifying attachments that were near-duplicates rather than perfect duplicates. The longest attachments were often written by and submitted by multiple commenters, with slight differences in introductions and signatories. These small differences meant that the attachments weren’t being picked up by the deduplication code. The longest (greater than 16,000 words) documents were flagged and manually reviewed for obvious missed duplicates. We stuck to only these documents because they would have the largest impact on token usage. The preferred version of the duplicate was identified, and the other version was modified so it would be successfully removed in the deduping code. This last step could have been done using a fuzzy matching algorithm in the future.
Our goal was to run all text chunks to ChatGPT with the same prompt, and store ChatGPT's responses in a programmatic way. Because we were unsure about the number of comments we'd be receiving, we decided to parallelize our API calls to improve the efficiency of our pipeline. The gpt_parallel.py script parallelizes our API calls to ten distinct, but identical OpenAI ChatGPT 4.0 model deployments and stores each individual response from ChatGPT in its own json file. Doing this ensures that any pipeline interruptions do not require us to rerun text chunks that we already have responses for and allows us to easily access the saved responses in a programmatic way. After all the json files are created, gpt_parallel.py loads each json file and uses pandas to create a dataframe consisting of one row per segment per topic returned by chatGPT.
Manual processes and things to consider for the future:
- ChatGPT prompt and topic list: The most manual and iterative part of this project was creating a ChatGPT prompt and topic list that would provide responses that were accurate and useful for OHS. We researched ChatGPT prompt engineering best practices and were able to develop a prompt that broke the text chunk into segments, and for each segment of text, assigned a list of relevant topics and an intent or list of intents (concern, agreement, question). The list of topics provided to ChatGPT were developed in tandem with OHS's policy team and aligned with the subjects found in the table of contents of the proposed rule. Before deciding on the final prompt and topic list, we went through multiple iterations of validation data where we randomly selected 10-30 early comments and ran all their text chunks through chatGPT with the current prompt. The results were presented to OHS along with an analysis indicating how often topics occurred together, alone, or in groups of three or more. This analysis was meant to help OHS and the Data Surge Team understand which topics could be reworded to be either more or less unique depending on how well the topic is performing.
- Triage chatGPT errors: One issue we encountered early in our process while prompt engineering was that even when explicitly asking ChatGPT to return its response as a valid json format, there were times when it wouldn’t, so we built the code in gpt_parallel.py to be robust to errors within the json files as well as any OpenAI errors. gpt_parallel creates a dataframe of text chunks whose json files are not valid and reruns those text chunks to chatGPT. Rerunning these chunks resulted in valid json files being created. The only issue we ran into during production that we had not already experienced was that ChatGPT decided that one of our text chunks should fire OpenAi's "content filter" which is meant for prompts that are malicious or inappropriate. After reviewing the text chunk and researching similar issues online, it seems that this was a common bug for ChatGPT 4.0, so we proceeded by including this text chunk as a "missing segment".
This dataframe of responses is then compared to the original chunked dataframe in order to determine if there are any chunks of text that ChatGPT did not return with a topic. We wanted to ensure that the final dataset provided to OHS included all text from every comment and attachment, so “segment_return.ipynb” runs a comparison of the segments returned from ChatGPT and the chunks of text created from our data processing step. Any text that the script finds in the chunked text but not in the returned segment text is added in to the final dataframe and categorized as a “Missing Segment”. The majority of those segments are text that shouldn’t be categorized (typos/jibberish, urls, signatures etc), but it was important that all the text from the original comments was represented in the final product.
The ChatGPT results were then combined with these missing segments and reformatted so that one row represented a topic, per segment, per chunk, per attachment. In conversations with the client, we identified two methods of identifying topics other than using ChatGPT. We had a couple terms that were being over-identified by ChatGPT, which we removed from the topic list and identified using exact matches only. If the exact term appeared in a segment, we added the corresponding topic as a new row if the topic had not already been identified. We also used a bill number / topic crosswalk to identify topics when the bill number (table of contents number) was referenced. For bill numbers that corresponded to only one topic, we added the corresponding topic as a row whenever the bill number appeared, and the topic had not already been identified. The “topic_origin” was recorded as either from ChatGPT, exact matching or bill numbers.
Once all of the topics were identified, any ChatGPT-created topics were dropped if an original topic had been identified for the segment. The code then rolled up any consecutive segments with the same topic. This was helpful for the client because it reduced the number of rows by combining text segments per topic, and made it easier for the reviewer to read the whole relevant segment of text at once. The new ‘segment_number’ reflected the starting and ending segment index to identify when segments differed. The results were finalized by matching in meta data from before the ChatGPT portion of the pipeline, changing the order of the columns, and dropping extraneous information. The results were saved and sent to the client.
Manual processes and things to consider for the future:
- The “segment_return.ipynb” includes some manual portions where the programmer needs to validate the results of the script.
- We worked closely with OHS to identify which exact terms should be used within our exact term search code. We also worked with them to identify which topics we wanted mapped to which bill numbers.
- For bill numbers that corresponded to multiple topics, we only added these topics if the segment was not categorized as one of the original topics (e.g. the segment was categorized to a ChatGPT generated topic, or the segment was an uncategorized missing segment). This decision was made because these multi-topic bill numbers would likely generate over-tagging, and we wanted to balance over-categorizing against under-categorizing. Upon deliberating with the client, they communicated that they would like fewer topics included in the bill number tagging, because under-categorizing was less of an issue than expected, and the over-categorizing caused by bill numbers created more administrative hassles. In the future, bill numbers can be used as a double check instead.
After the final excel file was created, we used the script "create_summaries.py" to group the dataframe by topic, and for each topic, send all tagged segments back to ChatGPT asking it to summarize the comments for that particular topic. One limitation of the OpenAI models is that they have limits to the amount of tokens (word portions) sent to their models. These limits appear in two forms. First, there is a limit to the amount of tokens you can send to a model at once; second, there is a limit to the amount of tokens you can send to the model over the course of one minute. These token "quotas" can be adjusted in Azure. To run our text chunks through to ChatGPT for topic analysis, we spread our model quotas across our ten different models, and for this summarization step, we reshuffled the quota to put all our available quota onto one model so we could fit as many segments as possible into our summary prompt. The "create_summaries.py" file groups the final dataframe by topic and then counts the total number of tokens that all the segments for each topic make up. Any topics whose total token count is under the token limit have all their relevant segments sent to ChatGPT for summarization. For topics that have a total token count larger than the ChatGPT limit, we randomly select comments to send to ChatGPT until we reach the token limit, and those are the segments that are sent to ChatGPT.
Manual processes and things to consider for the future:
- OHS Cleaned data for summary set 2: For this particular use case, OHS decided that to split the topics into two groups for summarization. The first topic group we ran through the summary code as soon as the excel file was ready. For the second group of topics, we waited for OHS to manually clean up the excel file to ensure that the segments being sent to chatGPT were as accurate as possible.
- OHS decided which topics to have longer summaries for: There were some topics where OHS preferred a slightly longer summary word count. We indicated in our code that for those topics, the prompt should ask for summaries that total around the OHS-requested word count.
- Summary prompt: We kept the summary script simple and provided a test set of summaries to OHS based on 2015 topics. The summaries returned from chatGPT varied in structure, which is something that OHS appreciated so we kept the prompt simple for our production run as well.
- We made the decision to randomize those comments at the document level, rather than the segment level. This gives shorter comments with less relevant segments a more equal chance of being selected to be summarized.
According to ACF OHS’s request, data files and code files should reside on the AWS infrastructure - virtual machine (vm): a EC2 instance - ohs-vm-aws, where a VPN gateway is created to link this AWS vm to the Azure virtual machine (ohs-vm) to secure the comment prompts passing and OpenAI Generative Pre-trained Transformers (GPT) models’ response data transfer. The OpenAI GPT models are only hosted on Azure OpenAI endpoints. For the data security purpose, those endpoints are associated with Arch subscription (RS-OHS-RCI-01-SUB) and are kept privately only to link to the Azure virtual machine in a virtual subnet (OHS-Vnet), where the AWS virtual machine can access OpenAI endpoints GPT models to perform thematic grouping analysis.
For full details of our cloud architecture setup, click here.
The final products delivered to OHS were the long Excel file with one row per segment per GPT-assigned topic including missing segments and the new topic tags that came from exact term matching and bill number tagging. This Excel workbook also includes a data dictionary so that the readers can identify all the columns in the final data file. We also provided a summary (in .docx format) for each topic requested by OHS.