The Center for Good Food Purchasing scores foods purchased by public institutions on five factors: local economies, health, valued workforce, animal welfare, and environmental sustainability.
To do this scoring, they transform the food names from purchase orders into a "normalized name" that is based on a hierarchical tagging structure. These normalized names are then scored on CGFP's five factors.
For example, CRANBERRIES, DRIED, INDIVIDUALLY PACKAGED (1.16 oz./pkg.)
becomes cranberry, dried, ss
and YOGURT, ASSORTED GREEK LIGHT SS CUP REF STRAWBERRY BLUEBERRY
becomes yogurt, greek, variety, ss
.
The name normalization process is time-consuming and error-prone, so we have trained a language model to perform a first-pass of name normalization.
This repo contains:
- A data cleaning pipeline to clean CGFP's (Center for Good Food Purchasing) historical labeled data for use in training a text classifier to perform their name normalization task. This is designed to be run locally.
- A training pipeline to train a text classifier to perform the name normalization task. This is to be run on UChicago's DSI Cluster.
CGFP has provided exports from their scoring platform that include tens thousands of examples with human-labeled normalized names.
However, there are lots of inconsistencies in the data and CGFP's name normalization requirements have changed over time.
Also, our text classifier is a multi-task classifer, that performs classification across multiple columns, so we need to split the normalized names into their appropriate columns.
The data pipeline takes in the normalized name (eg taquito, egg, turkey sausage, cheese, potato, frozen
) and splits it into multiple columns following the structure in CGFP's name normalization helper.
- Build the Docker container to run the pipeline (see the Docker section)
- Download the raw data you wish to clean (see the Raw Data section)
- Update
scripts/config_pipeline.yaml
with the filename and location of the data you wish to clean. Note that the file is expected to be in/data/raw
inside the container. - Run
scripts/pipeline.py
While the high-level idea of the data pipeline is relatively intuitive, the implementation is messy and full of one-offs and edge cases so it can be a bit hard to follow what's going on.
The basic intuition goes something like this:
- We split the normalized name on commas so we have a list of tags
- We process the tags one by one, allocating them to the appropriate name normalization column
- The first tag in a normalized name is always "Basic Type"
- Different columns have different allowed tags based on the product's Food Product Group and Food Product Category: we check if a tag is allowed in any of the other columns, and, if so, allocate it to that column
- If a tag is not allocated to any of the other columns, allocate it to a "Sub-Type" column
- Throughout the process, we check for edge cases and directly apply any rules associated with these edge cases
The allowed tags for each column are saved in misc_tags.py
Much of the rest of the pipeline code is handling edge cases. Most of these rules are saved as dictionaries in src/cgfp/constants/tokens
The data pipeline runs in Docker.
If you are using VS Code, there is a .devcontainer
folder with the Docker configuration to run as a dev container.
Otherwise, build the Docker image using the Dockerfile
in the root of the repo.
We've been using the pipeline to clean these two data sets:
Make sure to download them as xlsx
files from the google drive site. The files are expected to be the /data/raw
directory.
Once the data cleaning pipeline is run, files should appear in the directory data/clean/pipeline-{date}-{time}
. These files are the input for the training / text classifier. The following files are created:
value_counts.xlsx
normalized_name_diff.csv
scoring.csv
misc.csv
clean_CONFIDENTIAL_CGFP_bulk_data_073123.csv
The final file of this list is the one that is required for the classification task below.
We use Huggingface to train a multi-task text classifier on the name normalization task.
We take in an example's Product Type as input (eg CRANBERRIES, DRIED, INDIVIDUALLY PACKAGED (1.16 oz./pkg.)
) and we output a classification for each column in CGFP's name normalization tool.
Note that, other than "Food Product Group", "Food Product Category", "Primary Food Product Category", and "Basic Type", all of the other columns can be (and usually are) empty.
We have infrastructure to train both RoBERTa and DistilBERT models.
Before running the below make sure that you set up an environment variable with your wandb.ai api key. The variable should be set as WANDB_API_KEY
and should be available in the environment that is calling the make
commands below.
There is configuration variable in the config_train.yaml
called smoke_test
. Setting this to true
will run through a small subset of the data in order to test any code. You need to set it to False
in order to run the actual models.*
To get good results for all columns, we need to do a multi-stage fine-tuning process.
- First, clean a dataset using the data pipeline. Upload this dataset to wherever you'll be training your model.
- Upload a validation set and a test set
- Validation set. Note that this should be downloaded as a CSV file.
- Test set (Note: this is not a "true" test set, but is used at the end of training to run inference with the trained model) This should be downloaded as an Excel File.
- Update
scripts/config_train.yaml
with the location of your training, validation, and testing datasets and the location where you'd like to save your models.- All files should be in the same directory (
data_dir
in theconfig_train.yaml
). - You can also choose training options in this yaml file. Most of the defaults should work well.
- All files should be in the same directory (
- Build the conda environment in
environment.yml
. Note that this creates an environment with the namecgfp
, as per the file.
conda env create -f environment.yml
-
Create a
.env
file in the root directory. You will need to create a.env
file which contains the following information:[email protected] CONDA_ENV_PATH=full path to conda location CGFP_DIR=/net/projects/cgfp/ DSI_PARTITION=general ENV_NAME=cgfp
- The
CONDA_ENV_PATH
is specific to how you installed conda. You should look for a directory which contains thecfgp
environment that was created by the previousenvironment.yml
. If you aren't sure where to look you can try typingecho $CONDA_PREFIX
into the terminal to find the root. On my installation:and I useecho $CONDA_PREFIX /home/nickross/miniconda3 (base) nickross@fe01:~/miniconda3/envs/cgfp$ ls /home/nickross/miniconda3/envs/cgfp/
/home/nickross/miniconda3/envs/cgfp/
as myCONDA_ENV_PATH
- The
-
Then, train the model using the following command.
make train
- If you are not using the UChicago DSI cluster, activate the
cgfp
conda environment and runscripts/train.py
- If you are not using the UChicago DSI cluster, activate the
To get good performance across tasks, we run a multi-stage fine-tuning process where we freeze and unfreeze the base model while also attaching different classification heads to the computation graph. We can configure all of this in the config_train.yaml
.
We use the same command as above (make train
) on the DSI cluster to run the training. Updating the config_train.yaml
file changes how the training is run; it is basically a conf file which controls all aspects of the training.
We start by training the entire model on "Basic Type" while detaching all other classification heads from the computation graph (so they do not impact the representations from the base model). To do this, we set the following settings in config_train.yaml
(leaving all other rows unchanged.):
You want to use either the roberta or distilbert models, make sure to adjust the learning rate parameter in the config_train.yaml
accordingly.
model:
freeze_base: false
attached_heads:
- "Basic Type"
training:
metric_for_best_model: "basic_type_accuracy"
training:
lr: 2e-5 # .001 for distilbert, 2e-5 for roberta
Once this file is updated re-run make train
.
Next, we load the model trained on "Basic Type" only and train the full model on "Sub-Types".
model:
starting_checkpoint: path/to/basic/type/trained/model
freeze_base: false
attached_heads:
- "Sub-Types"
training:
metric_for_best_model: "mean_f1_score"
training:
lr: 2e-5 # .001 for distilbert, 2e-5 for roberta
The results after these two steps are usually quite good.
When running the above you want to pick one of roberta or distilbert and then run the first which will generate the starting checkpoint for use in the second. Make sure that the models in the starting_checkpoint align
If we have further cleaned the data and would like to just retrain the classification heads (without retraining the base model), we can train the model with the following settings:
model:
starting_checkpoint: path/to/fine/tuned/model
freeze_base: true
reset_classification_heads: true
attached_heads: null # Doesn't matter since base is frozen
training:
metric_for_best_model: "mean_f1_score"
We will typically be running inference on a spreadsheet of food labels. The output is set up to match CGFP's name normalization helper.
To run inference:
- Load the model and tokenizer:
model = MultiTaskModel.from_pretrained("uchicago-dsi/cgfp-roberta")
tokenizer = AutoTokenizer.from_pretrained("uchicago-dsi/cgfp-roberta")
- Use the
inference_handler
function:
inference_handler(model, tokenizer, input_path=INPUT_PATH, save_dir=DATA_DIR, device=device, sheet_name=SHEET_NUMBER, input_column="Product Type", assertion=True)
An example Colab notebook to run inference is available here.
The CGFP model is set up to predict multiple different columns from a single input string. For most of these columns, the model will make a single prediction (often including leaving the column empty).
However, for Sub-Types, we are doing multi-label prediction. This is because a single item can have multiple sub-types, and, while there is some ordering to the sub-types, it is inconsistent. So, we have a single Sub-Types multi-label classification head that can predict multiple sub-types. Any predicted sub-types are allocated to sub-type columns during inference.
Note that Huggingface is set up reasonably well to handle multi-task classification and multi-label classification separately, but combining the two required some customization.
We tried to document the logic for doing this, but there are several non-standard things happening in the forward
method of the model (and the loss functions) in order to do multi-task learning with one head being a multi-label head.
The model occasionally makes absurd predictions. These are usually from inputs that are outside of anything it has seen during training. We can usually catch these by noticing when "Food Product Group", "Food Product Category" & "Primary Food Product Category" do not make sense together.
If assertion=True
is passed to inference_handler
, a blank row will be outupt for any prediction where any of the outputs for "Food Product Group", "Food Product Category", and "Primary Food Product Category" are mismatched. Pass assertion=False
to disable this behavior.
We host the production versions of the models on huggingface at uchicago-dsi/cgfp-distilbert
and uchicago-dsi/cgfp-roberta
.
There are commands in the Makefile
to update the models hosted on Huggingface. Make sure the performance on these is good and stable before updating since CGFP is actively using these models!