This project builds upon the YOLOv12 architecture to perform multi-task learning:
- Object Detection: Detect food items.
- Weight Prediction: Predict the weight (in grams) of each detected food item.
We introduce an additional regression head to YOLOv12 to predict weights, enabling simultaneous localization and portion estimation from a single image.
- Multi-task Food object detection and weight (in grams) prediction.
- Single unified model: Jointly trained for classification, localization, and regression tasks.
- Evaluation metrics: Includes MAE (Mean Absolute Error) for weight estimation.
Our model is trained and evaluated on a specialized food dataset with annotated bounding boxes and weight labels in grams, available on Hugging Face:
➡️ Download Food Portion Benchmark Dataset on Hugging Face
Each image has an associated .txt
label file containing six columns:
class_id
(integer): ID of the food class.x_center
(float): Normalized x center of bounding box (0 to 1).y_center
(float): Normalized y center of bounding box (0 to 1).width
(float): Normalized width of bounding box (0 to 1).height
(float): Normalized height of bounding box (0 to 1).weight
(float): Ground truth weight of the food item in grams.
This extended label format enables simultaneous object detection and weight regression.
Training results comparing the different versions of the YOLOv8 and YOLOv12 models
You can download the best-performing pretrained YOLOv12-M model weights here:
conda create -n yolov12_foodweight python=3.11
conda activate yolov12_foodweight
# Install dependencies
pip install -r requirements.txt
pip install -e .
# (Optional) For FlashAttention support
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu11torch2.2cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
pip install flash_attn-2.7.3+cu11torch2.2cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
Training is handled through the train.py
script. This script loads the modified YOLOv12 model configuration, prepares the dataset, and launches the training process.
- You can train the model from scratch or fine-tune a pretrained YOLOv12 checkpoint.
- The model is trained to perform both object detection and weight regression tasks simultaneously.
- The training outputs include model checkpoints, loss curves, and metric evaluations over epochs.
We provide few scripts to generate predictions:
calculate_weight_MAE.py
: Runs inference, calculates regression MAE metric for weight prediction, and optionally save annotated images showing detection and predicted weights.predict_txt.py
: Runs inference and saves the predictions in a.txt
format.predict_csv.py
: Runs inference and saves the predictions in a.csv
format.YOLOv8_version_code
: Includes code for the YOLOv8 version of this project, as described in the paper.
Each prediction contains:
image_name
,class_id
,xmin
,ymin
,xmax
,ymax
,weight
,confidence
Choose the format depending on your post-processing or evaluation needs.
This project is based on ultralytics/ultralytics and YOLOv12. We extend the original work with an additional regression head for food weight prediction.
Please cite our work if you use the Multi-task model. (Citation will be added after publication.)
@article{,
title={A Multitask Deep Learning Model for Food Scene Recognition and Portion Estimation—the Food Portion Benchmark (FPB) Dataset},
author={Sanatbyek, Aibota and Rakhimzhanova, Tomiris and Nurmanova, Bibinur and Omarova, Zhuldyz and Rakhmankulova, Aidana and Orazbayev, Rustem and Varol, Huseyin Atakan and Chan, Mei Yen},
journal={IEEE Access},
year={2025}
}