Food Categorization and Health Scoring Model

Database Details

This project utilizes the FoodData Central database provided by the USDA. The database contains extensive information on food items and their nutrient profiles.

FoodData Central Data

Description: The database includes over 19,310 different categories of general food.
Source:
- Download datasets
- Data Dictionary
Contents: Detailed nutritional information for a variety of food items, organized across multiple categories.

Main Contents

Structuring ChatGPT Responses and Uploading to Hugging Face
Data Formatting
Example Data Entry
Upload to Hugging Face
Fine-Tuning Script

Project Purpose

The purpose of this project is to model a customized large language model capable of categorizing food or food ingredients based on various features. The project leverages the FoodData Central database and organizes food items using attributes such as:

Food Code
Main Food Description
WWEIA Category Number
WWEIA Category Description
Seq Num
Ingredient Code
Ingredient Description
Ingredient Weight (g)
Retention Code
Moisture Change (%)

Example of Categorized Data:

Food Code	Main Food Description	WWEIA Category Number	WWEIA Category Description	Seq Num	Ingredient Code	Ingredient Description	Ingredient Weight (g)	Retention Code	Moisture Change (%)
11000000	Milk, human	9602	Human milk	1	1107	Milk, human, mature, fluid (For Reference Only)	100	0	0

This model aims to accurately categorize food items based on these features, enabling better data organization and retrieval.

Project Process

To categorize the 19,310 food items and their features, ChatGPT-3 was utilized with carefully designed prompt engineering. The model was tasked with assigning categories and scores to each food item based on American Heart Association principles.

Prompt Engineering

A custom prompt was created to ensure accurate and consistent responses. The prompt categorized food into three categories and provided a healthiness score based on American Heart Association guidelines:

Categories:

Heart-Healthy
Heart-Unhealthy
Ambivalent

Scoring:

1 = Unhealthiest food
2 = Unhealthy food
3 = Ambivalent food
4 = Healthy food
5 = Healthiest food

Food Categorization and Health Scoring Project

The prompt used is as follows:

def find_category_and_score_v2(user_input):
    """
    Categorize and score food descriptions based on American Heart Association principles.
    """
    delimiter = "####"
    system_message = f"""
    You will be provided with a food query ...
    (continues with prompt description)
    """
    
    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': f"{delimiter}{user_input}{delimiter}"},
    ]
    return get_completion_from_messages(messages)

Example Query

Input Example

JSON Output

[
  {
    "Main food description": "Milk, human",
    "Food Code": "11000000",
    "category": "Heart-Healthy",
    "Score": 5
  }
]

This process was repeated for all 19,310 food items, generating a category and score for each.

Food Item Database Creation

Database Creation

After several hours of response collection over multiple days, a comprehensive database was constructed using the outputs generated by ChatGPT-3. This database contains the categorized results and scores for over 19,310 food items.

Process Summary

ChatGPT-3 Prompting

The model was prompted using custom-engineered queries to categorize food items based on features like:

Main food description
Food code
WWEIA category number
Ingredient description, etc.

Response Collection

The responses from the model included JSON-formatted objects, containing:

Food Category (Heart-Healthy, Heart-Unhealthy, Ambivalent)
Health Score (1-5, based on American Heart Association guidelines)

Example of Collected Data

[
  {
    "Main food description": "Milk, human",
    "Food Code": "11000000",
    "category": "Heart-Healthy",
    "Score": 5
  },
  {
    "Main food description": "Beef, raw",
    "Food Code": "13010000",
    "category": "Heart-Unhealthy",
    "Score": 1
  }
]

Task 2: Fine-tuning the LLaMA Model

After constructing the database of food categorizations and health scores, the next step was to fine-tune a pre-trained LLaMA model using this dataset. The fine-tuned model was designed to specialize in categorizing food based on American Heart Association guidelines, similar to the original task handled by ChatGPT-3.

Fine-tuning Process

Base Model

The model selected for fine-tuning is the LLaMA-2-7B Chat model, available on Hugging Face:

model_name = "NousResearch/Llama-2-7b-chat-hf"

Dataset

The instruction dataset created during Task 1 was uploaded to Hugging Face under the following name:

dataset_name = "adarsha30735/datafood"

Fine-tuning Objective

The goal of fine-tuning was to adapt the LLaMA-2-7B model to better understand and categorize food items based on the patterns and features captured in the dataset.

Output

The fine-tuned model was saved under the name llama-2-7b-heartgpt, specializing in food categorization:

new_model = "llama-2-7b-heartgpt"

Model Creation

A new model was created and trained using the fine-tuned dataset, enabling it to predict food healthiness, provide heart-health categories, and score new food items more accurately than before.

Final Model

The fine-tuned llama-2-7b-heartgpt model is ready for inference tasks related to food categorization and health scoring based on the provided data.

Task 3: Structuring ChatGPT Responses and Uploading to Hugging Face

The responses generated by ChatGPT-3 for all 19,310 food items were structured and formatted before being uploaded to Hugging Face. This dataset now serves as the training data for fine-tuning the LLaMA-2-7B model, specializing in food categorization and health scoring.

Data Formatting

All 19,310 responses were structured in a consistent format, ensuring each entry included all the necessary fields related to the food item, its features, and ChatGPT's classification result.

Structured Data Format

<s>
Food code    Main food description    WWEIA Category number    WWEIA Category description    Seq num    Ingredient code    Ingredient description    Ingredient weight (g)    Retention code    Moisture change (%)    Formatted Description    ChatGPT classification result
</s>

Example Data Entry

<s>
11000000    Milk, human    9602    Human milk    1    1107    Milk, human, mature, fluid (For Reference Only)    100    0    0    ####Categorize and give score to this description of food.
Main food description: Milk, human
Food Code: 11000000
WWEIA Category description: Human milk
Ingredient description: Milk, human, mature, fluid (For Reference Only)
Weight of the food: 100.0 grams.
####
[{
    'Main food description': 'Milk, human',
    'Food Code': '11000000',
    'category': 'Heart-Healthy',
    'Score': 5
}]
</s>

Upload to Hugging Face

The entire structured dataset was then uploaded to Hugging Face under the following dataset name:

adarsha30735/datafood

Script Overview

Script Name: heartgpt_fine_tune_llama_2_in_google_colab.py
Purpose: Fine-tunes the LLaMA-2-7B model to specialize in food categorization and health scoring based on the data provided.
Environment: The script is optimized to run in Google Colab, leveraging its computational resources for model training.

Key Features

Loads the dataset from Hugging Face.
Configures model parameters and hyperparameters for training.
Implements training loops, validation, and evaluation metrics.
Saves the fine-tuned model for future inference tasks.

How to Use

Open Google Colab and create a new notebook.
Upload the heartgpt_fine_tune_llama_2_in_google_colab.py script to your Colab environment.
Run the script, ensuring that you have the necessary libraries and access to the dataset on Hugging Face.
After training, the fine-tuned model will be saved under the name llama-2-7b-heartgpt.

Contributors and Fine-Tuning Reference

The fine-tuning of the LLaMA-2-7B model was successfully completed with the assistance of various contributors who provided insights and guidance throughout the process. A key resource utilized during this phase was the code from the younesbelkada, which served as a reference for adapting the model to our specific needs in food categorization and health scoring.

For detailed reference code, you can check the GitHub page here.

Fine-Tuning Process Overview

Reference Code: The fine-tuning code adapted from the younesbelkada was instrumental in streamlining the training process for our specific dataset of food items. The structure, methodology, and techniques employed in the reference helped optimize our fine-tuning approach.

Implementation

The reference code provided by contributors was modified to fit our dataset structure and fine-tuning objectives, ensuring the model could effectively learn from the data. The integration of techniques from the Guanaco Dataset ensured that our model adhered to the best practices in fine-tuning transformer models for specific tasks.

Final Model

The final output from this collaborative fine-tuning process is the llama-2-7b-heartgpt model, which excels in predicting the healthiness of food items, categorizing them according to American Heart Association guidelines, and providing accurate scores based on the processed dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Notebooks		Notebooks
2019-2020 FNDDS At A Glance - FNDDS Ingredients.xlsx		2019-2020 FNDDS At A Glance - FNDDS Ingredients.xlsx
README.md		README.md
heartgpt_fine_tune_llama_2_in_google_colab.py		heartgpt_fine_tune_llama_2_in_google_colab.py

adarsha30735/HeartGPT-LLM-Project

Folders and files

Latest commit

History

Repository files navigation

Food Categorization and Health Scoring Model

Database Details

FoodData Central Data

Main Contents

Project Purpose

Example of Categorized Data:

Project Process

Prompt Engineering

Categories:

Scoring:

Food Categorization and Health Scoring Project

The prompt used is as follows:

Example Query

Input Example

JSON Output

Food Item Database Creation

Database Creation

Process Summary

ChatGPT-3 Prompting

Response Collection

Example of Collected Data

Task 2: Fine-tuning the LLaMA Model

Fine-tuning Process

Base Model

Dataset

Fine-tuning Objective

Output

Model Creation

Final Model

Task 3: Structuring ChatGPT Responses and Uploading to Hugging Face

Data Formatting

Structured Data Format

Example Data Entry

Upload to Hugging Face

Script Overview

Key Features

How to Use

Contributors and Fine-Tuning Reference

Fine-Tuning Process Overview

Implementation

Final Model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages