An LLM should be used as a first pass at incentive data, as it will make mistakes that a human would not make. Nonetheless, it should save a significant amount of time for incentives data gathering, because:
- humans are much better at reviewing existing text than populating it themselves
- the model can work at impressive scale
- the model is correct somewhere around 75-80% percentage of the time, depending on how you measure it
There are parts of this process that are still pretty manual/painful; the good news is that you have to do them once and then you can run multiple LLM iterations over the data easily.
At a high-level, using the LLM has the following steps:
- one-time onboarding/setup
- (manual) collecting where the data that you want to collect is – that is, the URLs
- (mostly automated) retrieving and curating web content
- (automated) pass the data through the LLM
- (manual) review
If doing a state refresh, the steps are the largely the same, except that #2 is likely taken care of for you.
- This assumes you have node, yarn, typescript installed.
- Clone the repo and run
yarn
to install dependencies.- For GPT models, ask an account owner (most Incentive API devs have been invited) for an invite to our OpenAI account. Create an API key and put it in a
.env
file withOPENAI_API_KEY=<api key>
.- Compile and then run
node build/src/test_request_gpt.js
. If it works, it will print 3 jokes to the console.
- Compile and then run
- For Google models, ask an account owner (currently, just Dan) for an invite to our account. Download a credentials JSON file and then set
export GOOGLE_APPLICATION_CREDENTIALS=<path/to/credentials/file>
from the command-line.- Compile and then run
node build/src/test_request_palm.js
. If it works, it will print 3 jokes to the console.
- Compile and then run
- For GPT models, ask an account owner (most Incentive API devs have been invited) for an invite to our OpenAI account. Create an API key and put it in a
- Congrats, you've sent your first request(s) to an LLM!
- Download
Xpdf command line tools
from https://www.xpdfreader.com/download.html. This will be used later during the text generation step. - If you're planning on doing semi-automated text retrieval (recommended), download an html-to-text converter. If you're on Mac, you can skip this step, as Mac comes with one already installed (
textutil
). If you're retrieving the text manually skip this step.
Collect a list of all of the URLs you want to retrieve incentives content from. For each row in the Tier 1 data sheet, you may end up with multiple URLs. The URLs in the Tier 1 sheet are like program URLs that introduce the program; what you care about for this process is the actual data source URL(s), since you're going to curl
them. This works best if the URLs are as specific about individual incentives as possible, so make sure to click through to individual incentive pages rather than summary pages if they exist. The pages we're trying to collect should contain all the detail that might appear in the captured incentive. Note that the tier 1 spreadsheet is usually complete but might be missing some programs, for example city-level programs in major cities. Check around for those. Unfortunately, this part is manual and it can take a while – up to a couple hours for larger states.
Note that if you are doing a refresh of a state, you likely have the URLs already – they should be in the existing data in the Data Source URL(s)
columm (or similar).
To use the URLs in the rest of the process, create a Google sheet with each URL you want to visit on a different line. You can include arbitrary metadata as well that isn't found on the website itself, such as the Authority and Program names, other URLs, etc. There is also a rudimentary tags
column – the only tag currently used is index
for websites that don't contain incentive information themselves, but contain links to other pages that do. We collect these to eventually enable change detection on those sites. In any case, feel free to define your own tags if helpful. Example spreadsheet is here.
The only requirement is that this file must contain a folder
column that corresponds to the folder you will create in the repo where your data will live. The URL column doesn't actually matter since the next script won't read it directly. Create a folder called in/
in the root directory, then download as CSV and put it in the in/
folder.
Create a new state folder (or more generally, any new folder) under incentives_data/
, using the <folder>
value you put in the sheet above.
Compile and then run node build/src/create_metadata_files.js -i <your_input_file.csv>
. This will create a bunch of metadata files in incentives_data/<folder>
that will be used to propagate the metadata you included above. It will also assign each row a numeric ID and use that as the basis for filenames; e.g. the first non-header row in your spreadsheet is associated with 0.txt
and 0_metadata.json
.
Now we have to create text data for the LLM model to use to produce incentive data. The text data is basically what you would see on a website that describes the incentive (or a large set of incentives).
The current file structure is that the LLM looks for all .txt
files in a given folder and for each, it queries the model. The model request is a common prompt with the text from the text file appended at the end. It then combines the results into a large spreadsheet. If you want, you can do this as a purely manual process and simply feed in arbitrary text files, or use a semi-automated approach.
If grabbing the text data manually, run: for i in {0..23}; do touch $i.txt; done
(replacing 23
based on the number of text files you need to create). If using the automated approaches below, skip this step.
TODO: put as much of this in a bash script as is feasible.
- Use
curl
to fetch webpage content. In your metadata spreadsheet that has the filename numbers associated with each row (add a column for this if not done already), add another column with the formula:=G2&" -o "&H2&".html"
, changingG2
andH2
to point to the URL column and the file number, respectively. At the bottom, add="curl "&textjoin(" ",true,I2:I89)
(replacing the range with the column you just created) to get thecurl
command you need to run. Run this from the destination folder and it will sequentially pull the raw HTML data from the web and save it into.html
files. - Find any PDFs in the previous step and rename the extension to
.pdf
instead of.html
. This can be done manually or with a bit of Google sheets magic. - Figure out where the
pdftotext
script lives (this is the Xpdf command line tools you installed during setup). - Run:
find *.pdf -exec bash -c '<path/to>/pdftotext -layout "$0" "${0%.*}.txt"' {} \;
. This finds all PDFs in the current folder, runs the pdftotext script on them, and saves the result in a text file with the same name (e.g.1.pdf
gets parsed into1.txt
). - Use a html-to-text converter to convert the
.html
files to text. For Mac, there is already something available calledtextutil
. Linux might need to look at something else. Run the command with something like:find *.html -exec bash -c 'textutil -convert txt "$0" -output "${0%.*}.txt"' {} \;
.
At this point, you should have a .txt
for every URL.
- Go through the text files and ensure that each one is populated. Occasionally websites are Javascript-oriented so the HTML doesn't have any meaningful text to extract. If that's the case, pull down any text that is relevant to the incentives on that page, and put it in the text file.
- Clean up the txt files. That could be removing text from the beginning or end that isn't incentive-related, cleaning up tables that didn't parse well, etc. You might need to open the original website or PDF in some cases (there are typically extensions for your code editor so you can do these side-by-side). You also may want to delete any incentives from the txt files that we don't care about (i.e., refrigerator, etc).
If you don't want to send a particular row to the LLM for any reason, just clear out the text or delete the file. You can also add index
to the tags
field of the metadata CSV (or metadata file) as described above.
Whether achieved manually or partially automated, you should now have a bunch of text files in a folder under incentives_data/
containing the text from the websites that is in a state where a human or AI would have a decent shot of extracting meaningful information from it.
- Compile and then run the script from the root directory:
node build/src/llm_runner.js --folders=<name of folder with text data>
. Runnode build/src/llm_runner.js --help
for details on other flags. We use the PaLM model by default, but this can be controlled with the--model_family
or-m
parameter. Note that while supplyinggpt
orpalm
(default) are relatively cheap,gpt4
is more expensive, so you can have real monetary consequences if you're not careful.- For the next few steps, start by using
palm
orgpt
until you're able to generate output files without errors. - Once the files are ready for the final run, you can use
gpt4
. Check https://platform.openai.com/usage to ensure that costs are not getting too high.
- For the next few steps, start by using
- It will take a few minutes. Apparently there are periodic cases where the API times out after 10 minutes, but these are rare. There's also the potential for rate-limiting, so if you have lots of files, use the
--wait
parameter (in milliseconds) to put some time in between each request. Usually a couple seconds is fine. - The script writes outputs to a specific subfolder with a RunID in the
out/
directory. The script will print your RunID, though it's timestamp-based and should be the most recent one. In that folder, you should see:- An
output.csv
containing the parsed data - An
outputs/
directory containing all the JSON associated with each input file - A
parameters.json
file saving the details of the run so we can figure out what worked and what didn't - Possibly, a
dropped_files.json
which will list any files that had errors to retry. You'll fix those in the next step.
- An
- If there are any files in
dropped_files.json
, review and fix them. You can rerun these by passing the same command as before with the-r
flag (e.g.-r incentives_data/ca/31.txt incentives_data/mn/17.txt
) to restrict the run just to certain text files. The most common cause of error is that the LLM models have limits on the length of their output, ~4k characters for GPT and ~8k characters for PaLM. If this is the case, you'll see someError parsing json: SyntaxError: Unexpected token
or similar errors in the console, because the output is being cutoff resulting in malformed JSON. To fix this you can either:- Review the input txt files and delete any irrelevant incentives, if there are any
- Split up the txt files into
1.1.txt
,1.2.txt
, etc
This data should be considered a rough first pass. You'll end up making a lot of changes, but it should still save time.
We have a doc that describes recommended post-processing steps here: https://docs.google.com/document/d/1pCIBaYrSiT9ufA9tVqPpZlrbjNGrvZd9ZfItEjjyvJc.
After running through the steps in that doc and thoroughly checking the incentives, finally run the spreadsheet-to-json script and fix any errors.
We have a basic eval framework set up to evaluate how a model is performing. This is not something you can do with data recently collected; you need to have a golden dataset that is considered correct.
This requires some file setup:
-
In each folder in
incentives/data
that you want to evaluate, you need golden files with suffix_golden.json
, e.g.1_golden.json
is the golden file for1.txt
. These are JSON files that contain the incentives as they should appear; in other words, the answer key. -
Then execute a model run as described in the previous section, retaining the RunID. Run
node build/src/run_evals.js -r <run_id>
to run your eval. You'll see adiffs.json
andreport.csv
in theout/<run_id>
folder now, containing the "raw" diffs and then a summary report. -
To interpret the output: we grade every key (field) that the model is requested to produce. For all keys, we first investigate whether it is missing in both the golden and predicted (model-produced) output, missing in only one, or populated in both. If populated in both, we compare the two. For all fields, we start by looking for an case-insensitive match. If we don't find it, for shorter fields, we'll use a fuzzy matching algorithm, and for longer ones, we actually send it back to the model to grade its own response! The fuzzy match is binary (FuzzyMatch or FuzzyNoMatch), but the model grading has four levels to give a bit more nuance in the answer.
The diffs.json
files have a field-by-field comparison including an explanation of why the field was given the grade that it was (for model-graded fields). The report.csv
rolls this up so you just see the totals by key and grade.
-
We only have the ability to compare two JSON files with the same number of incentives in the same order right now, so in practice, it's typically best to use an artificial test set specifically constructed for this purpose. In the longer term, it would be nice to be able to construct diffs that are aware if the model (or golden) is missing incentives.
-
LLMs are not deterministic. In practice, this isn't a major issue for GPT3.5 and PaLM 2 – they are deterministic enough so that the metrics don't vary much between runs. For GPT4, I recommend you run at least 3 trials and inspect how similar the numbers are before being confident you're seeing a meaningful change.