Koki Maeda(3,1)*, Tosho Hirasawa(4,1)*, Atsushi Hashimoto(1), Jun Harashima(2), Leszek Rybicki(2), Yusuke Fukasawa(2), Yoshitaka Ushiku(1)
(1) OMRON SINIC X Corp. (2) Cookpad Inc. (3) Tokyo Institute of Technology (4) Tokyo Metropolitan University
*: Equally Contribution. This work is done for the internship at OMRON SINIC X.
Note
@InProceedings{comkitchens_eccv2024,
author = {Koki Maeda and Tosho Hirasawa and Atsushi Hashimoto and Jun Harashima and Leszek Rybicki and Yusuke Fukasawa and Yoshitaka Ushiku},
title = {COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark},
booktitle = {Proceedings of the European Conference on Computer Vision},
year = {2024},
}
This COMKitchens dataset provides cooking videos annotated with a structured visual action graph. The dataset currently has two benchmarks:
- Dense Video Captioning on unedited fixed-viewpoint videos (DVC-FV)
- Online Recipe Retrieval (OnRR)
We provide all the dataset for the benchmarks and attach .dat
files which represent the train/validation/test split.
data
├─ ap # captions for each action-by-person entry
├─ frames # frames extracted from videos (split into train/valid/test)
├─ frozenbilm # features by FrozenBiLM (used by vid2seq)
└─ main # recipes annotated by human
└─ {recipe_id} # recipe id
└─ {kitchen_id} # kitchen id
├─ cropped_images # cropped images of bounding boxes for visual action graph
├─ frames # annotated frames for AP of visual action graph
├─ front_compressed.mp4 # recorded video
├─ annotations.xml # annotations in xml file format
├─ gold_recipe_translation_en.json # recipe annotations
├─ gold_recipe.json # rewritten recipe (in Japanese)
├─ graph.dot # visual action graph
├─ graph.dot.pdf # visualization of visual action graph
└─ obj.names
├── ingredients.txt # ingredients list in the COM Kitchens dataset
├── ingredients_translation_en.txt # translated ingredients list in the COM Kitchens dataset
├── train.txt # list of recipe id in the train split
└── val.txt # list of recipe id in the validation split
gold_recipe.json
provides the recipe information, to which the visual action graph is attached.
key | value | description |
---|---|---|
"recipe_id" | str |
recipe id |
"kitchen_id" | int |
kitchen id |
"ingredients" | List[str] |
ingredients list (in Japanese) |
"ingredient_images" | List[str] |
path of the images of each ingredient |
"steps" | List[Dict] |
annotations by step |
"steps/memo" | str |
recipe sentence |
"steps/words" | List[str] |
recipe split word by word |
"steps/ap_ids" | List[Dict] |
Correspondence between AP and words |
"actions_by_person" | List[str] |
annotation of the visual action graph, including the time span and bounding boxes |
gold_recipe_translation_en.json
provides only the translated recipe information.
key | value | description |
---|---|---|
"ingredients" | List[str] |
ingredients list (in English) |
"steps" | List[Dict] |
annotations by step |
"steps/memo" | str |
recipe sentence |
"steps/words" | List[str] |
recipe split word by word |
"steps/ap_ids" | List[Dict] |
Correspondence between AP and words |
Note
Application Form English support will be available soon.
- Dataset Preparation
- Download annotation files and videos.
- Preprocess
- Run
python -m com_kitchens.preprocess.video
for extracting all frames of the videos. - Run
python -m com_kitchens.preprocess.recipe
for extracting all action-by-person entries of the videos.
- Run
Warning
While we extract all frames in preprocess for simplicity, you can save disk storage space by extracting only the frames you use with the annotation files.
- Training
- Run
sh scripts/onrr-train-xclip.sh
for simple start of trainings.
- Run
- Evaluation
- Run
sh scripts/onrr-eval-xclip.sh {your/path/to/ckpt}
for the evaluation.
- Run
For UniVL, we are required to extract s3d features of the videos.
- Download
s3d_howto100m.pth
tocache/s3d_howto100m.pth
or other path you configure. - Run
sh scripts/extract_s3d_features.sh
to extract s3d features. - Download pretrained model
univl.pretrained.bin
tocache/univl.pretrained.bin
or other path you configure. - Then you can run
sh scripts/onrr-train-univl.sh
to train UniVL models.
- Docker Images
- Run
make build-docker-images
to build docker images.
- Run
- Preprocess
- Run
sh scripts/dvc-vid2seq-prep
to extract
- Run
- Training & Evaluation
- Run
sh scripts/vid2seq-zs.sh
to evaluate a pre-trained vid2seq model - Run
sh scripts/vid2seq-ft.sh
to fine-tune and evaluate a vid2seq model - RUn
sh scripts/vid2seq-ft-rl-as.sh
to fine-tune and evaluate a vid2seq model incorporating action graph as both relation labels and attention supervision (RL+AS)
- Run
This project (other than the dataset) is licensed under the MIT License, see the LICENSE.txt file for details.