MemClipCap: Enhancing ClipCap with long-range dependency handling for video captioning

This repository provides the code for our project, which combines kNN-Memory and ClipCap to improve long-range dependency handling. The project builds on the ClipCap and Memorizing Transformers repositories. This work is conducted as part of the academic curriculum for the Deep Learning 2 course at the University of Amsterdam. You can read our comphrehensive report here.

Project Structure

The project is structured as follows:

├── checkpoints (model checkpoints)
├── demos (demo notebooks)
├── images (images used in the report)
├── logs (training logs)
├── src (source code)
│   ├── dataset (dataset code, including parsers)
│   ├── evaluation (evaluation code for metrics)
│   ├── memorizing_transformers_pytorch (Memorizing Transformers code)
│   ├── models (model code for kNN-Memory and ClipCap)
│   ├── generate_captions.py (generate captions for a dataset)
│   ├── predict.py (predict captions for a video)
│   ├── train.py (train a model)
│   ├── validate.py (validate a model)
│   ├── utils.py (utility functions)
├── environment.yml (conda environment file)
├── requirements.txt (pip requirements file)
├── blogpost.md (report)
├── pyproject.toml (project file)
└── README.md (this file)

Requirements

The code is written in Python 3.10. Install the required packages using either pip install -r requirements.txt or by creating a conda environment with the provided environment.yml file using conda env create -f environment.yml. Activate the environment using conda activate knn-memory-clipcap.

Dataset

Our experiments use the ActivityNet Caption dataset. Use one of the following methods to download the dataset. The first method is recommended, other methods are only provided for full reproducibility.

To download the pre-processed video clips, run:

cd src/data/

wget "https://drive.google.com/u/0/uc?id=1fhZc7yM4Xja7rixz7hBLspPsYbaEQBYm&export=download&confirm=t" -O activitynet_ViT-B_32_train_first_2000.pkl
wget "https://drive.google.com/u/0/uc?id=1vliDDQxoSdrl5ZaJ-9DZBBEc8cQYwztA&export=download&confirm=t" -O activitynet_ViT-B_32_dev_first_250.pkl
wget "https://drive.google.com/u/0/uc?id=1C2qaf3xBXwfr-LDfygnO8GK-12DCuxxn&export=download&confirm=t" -O activitynet_ViT-B_32_validation_first_500.pkl
wget "https://drive.google.com/u/0/uc?id=1KHAXlNhp3GoXyh1mmLCr4iuktqez92F8&export=download&confirm=t" -O activitynet_ViT-B_32_dev_all_67.pkl
wget "https://drive.google.com/u/0/uc?id=1rsQgeIveEXyFqVicBFMmNaaZa4VO7jWZ&export=download&confirm=t" -O activitynet_ViT-B_32_train_all_540.pkl
wget "https://drive.google.com/u/0/uc?id=18MK9omT8qNfuW69KL_WZwP2PSBhMrYdV&export=download&confirm=t" -O activitynet_ViT-B_32_validation_all_133.pkl

Instead of wget, you can also download the files manually from here. The files should be placed in the src/data/ folder. Additionally, the pre-processed COCO dataset can be found there as well.

If you want to download the entire ActivityNet Caption dataset from scratch, run:
```
python3 src/datasets/download_dataset.py
```
WARNING: this will download the entire dataset, which is about 200 GB in size.

To extract frames from the downloaded videos or your own videos, execute:
```
python3 src/datasets/extract_frames.py -r <path_to_videos>
```
This command creates a frames folder in the videos' parent directory. By default, frames are extracted at 5 fps. To modify this setting, use the -fps flag. The script also generates a summary CSV file in the frames folder, containing the video ID, frame extraction success status, and number of frames extracted.

To pre-process the dataset, run:
```
python3 src/dataset/parsers/parse_activitynet.py --split <split> 
```
Other arguments are available, see python3 src/dataset/parsers/parse_activitynet.py --help for more information.

Demo

Generating a caption for a video can be done in the demo notebook found in notebooks/demo.ipynb.

Training

To train a model, run:

python src/train.py --train_path activitynet_ViT-B_32_train_first_2000.pkl --valid_path activitynet_ViT-B_32_dev_first_250.pkl --checkpoint checkpoints/coco/coco_prefix-best.pt --prefix activitynet_with_memory --only_prefix --use_video_dataset --use_memory

Use the --use_memory flag to enable kNN-Memory, and the --use_video_dataset flag to use the video dataset. Additionally, the --only_prefix flag can be used to only train the prefix model. The full argument list is available using python src/train.py --help.

Evaluation

To evaluate a model, run:

python src/validate.py --data /Users/sebastiaan/Developer/knn-memory-clipcap/src/data/ --checkpoint checkpoints/activitynet_with_memory-best.pt --only_prefix --use_video_dataset --use_memory

The full argument list is available using python src/validate.py --help.

Generate Captions

To generate captions for a dataset, run:

python src/generate_captions.py --data /Users/sebastiaan/Developer/knn-memory-clipcap/src/data/ --checkpoint checkpoints/activitynet_with_memory-best.pt --only_prefix --use_video_dataset --use_memory

This will generate two JSON files that can be used to calculate the evaluation metrics. The full argument list is available using python src/generate_captions.py --help.

Evaluation Metrics

To calculate the evaluation metrics on previously generated captions, run:

python src/evaluation/evaluate_captions.py --submission <captions_file>.json --references <reference_file>.json

Where <captions_file>.json is the file generated by src/generate_captions.py and <reference_file>.json is the file containing the ground truth captions. Our generated captions can be found in the \organized_data folder. To run the evaluation it is necessary to have installed Java on your device. The full argument list is available using python src/evaluation/evaluate_captions.py --help.

Acknowledgements

This project is conducted as part of the academic curriculum for the Deep Learning 2 course at the University of Amsterdam. We would like to thank the course staff for their support and feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MemClipCap: Enhancing ClipCap with long-range dependency handling for video captioning

Project Structure

Requirements

Dataset

Demo

Training

Evaluation

Generate Captions

Evaluation Metrics

Acknowledgements

About

Releases

Packages

Contributors 5

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 257 Commits
checkpoints		checkpoints
demos		demos
images		images
src		src
.gitignore		.gitignore
README.md		README.md
blogpost.md		blogpost.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

SebastiaanJohn/knn-memory-clipcap

Folders and files

Latest commit

History

Repository files navigation

MemClipCap: Enhancing ClipCap with long-range dependency handling for video captioning

Project Structure

Requirements

Dataset

Demo

Training

Evaluation

Generate Captions

Evaluation Metrics

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages