You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Merge branch 'self_critical_bottom_up' into self-critical
* self_critical_bottom_up: (42 commits)
Add advanced. (Still nothing in it.)
Update readme.
Sort the features in the forwarding instead of dataloader.
Add compatibility to resnet features.
Add comments in Attmodel.
Make image_root an optional option when prepro_label.
Add options and verbose for make_bu_data.
Add cider submodule
Simplify resnet code.
Update more to 0.4 version.
Update to pytorch 0.4
Fix some in evals.
Simplify AttModel.
Update FC Model to the compatible version (previously FC Model is depreacated and not adapted to new structure.)
Move set_lr to the right place in train.py
Add max ppl option (beam search sorted by perplexity instead of logprob) (it doens't seem changing too much)
Fix a bug in ensemble sample.
Add logit layers option. (haven't reigourously tested if it works or not)
Allow new ways of computing (using pack sequence) capable of using dataparallel.
Add batch normalization layer in att_embed.
...
# Conflicts:
# misc/rewards.py
# train.py
Copy file name to clipboardExpand all lines: README.md
+65-16Lines changed: 65 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,46 +1,78 @@
1
-
# Self-critical Sequence Training for Image Captioning
1
+
# Self-critical Sequence Training for Image Captioning (+ misc.)
2
2
3
-
This is an unofficial implementation for [Self-critical Sequence Training for Image Captioning](https://arxiv.org/abs/1612.00563). The result of FC model can be replicated. (Not able to replicate Att2in result.)
3
+
This repository includes the unofficial implementation [Self-critical Sequence Training for Image Captioning](https://arxiv.org/abs/1612.00563) and [Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering](https://arxiv.org/abs/1707.07998).
4
4
5
-
The author helped me a lot when I tried to replicate the result. Great thanks. The latest topdown and att2in2 model can achieve 1.12 Cider score on Karpathy's test split after self-critical training.
5
+
The author of SCST helped me a lot when I tried to replicate the result. Great thanks. The att2in2 model can achieve more than 1.20 Cider score on Karpathy's test split (with self-critical training, bottom-up feature, large rnn hidden size, without ensemble)
6
6
7
-
This is based on my [neuraltalk2.pytorch](https://github.com/ruotianluo/neuraltalk2.pytorch) repository. The modifications is:
8
-
- Add self critical training.
7
+
This is based on my [ImageCaptioning.pytorch](https://github.com/ruotianluo/ImageCaptioning.pytorch) repository. The modifications is:
8
+
- Self critical training.
9
+
- Bottom up feature support from [ref](https://arxiv.org/abs/1707.07998). (Evaluation on arbitrary images is not supported.)
10
+
- Ensemble
11
+
- Multi-GPU training
9
12
10
13
## Requirements
11
14
Python 2.7 (because there is no [coco-caption](https://github.com/tylin/coco-caption) version for python 3)
12
-
PyTorch 0.2 (along with torchvision)
15
+
PyTorch 0.4 (along with torchvision)
16
+
cider (already been added as a submodule)
13
17
14
-
You need to download pretrained resnet model for both training and evaluation. The models can be downloaded from [here](https://drive.google.com/open?id=0B7fNdx_jAqhtbVYzOURMdDNHSGM), and should be placed in `data/imagenet_weights`.
18
+
(**Skip if you are using bottom-up feature**): If you want to use resnet to extract image features, you need to download pretrained resnet model for both training and evaluation. The models can be downloaded from [here](https://drive.google.com/open?id=0B7fNdx_jAqhtbVYzOURMdDNHSGM), and should be placed in `data/imagenet_weights`.
15
19
16
-
## Pretrained models
20
+
## Pretrained models (using resnet101 feature)
17
21
Pretrained models are provided [here](https://drive.google.com/open?id=0B7fNdx_jAqhtdE1JRXpmeGJudTg). And the performances of each model will be maintained in this [issue](https://github.com/ruotianluo/neuraltalk2.pytorch/issues/10).
18
22
19
-
If you want to do evaluation only, then you can follow [this section](#generate-image-captions) after downloading the pretrained models.
23
+
If you want to do evaluation only, you can then follow [this section](#generate-image-captions) after downloading the pretrained models (and also the pretrained resnet101).
20
24
21
25
## Train your own network on COCO
22
26
23
-
### Download COCO dataset and preprocessing
24
-
25
-
First, download the coco images from [link](http://mscoco.org/dataset/#download). We need 2014 training images and 2014 val. images. You should put the `train2014/` and `val2014/` in the same directory, denoted as `$IMAGE_ROOT`.
27
+
### Download COCO captions and preprocess them
26
28
27
29
Download preprocessed coco captions from [link](http://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip) from Karpathy's homepage. Extract `dataset_coco.json` from the zip file and copy it in to `data/`. This file provides preprocessed captions and also standard train-val-test splits.
28
30
29
-
Once we have these, we can now invoke the `prepro_*.py` script, which will read all of this in and create a dataset (two feature folders, a hdf5 label file and a json file).
`prepro_labels.py` will map all words that occur <= 5 times to a special `UNK` token, and create a vocabulary for all the remaining words. The image information and vocabulary are dumped into `data/cocotalk.json` and discretized caption data are dumped into `data/cocotalk_label.h5`.
37
38
39
+
### Download COCO dataset and pre-extract the image features (Skip if you are using bottom-up feature)
40
+
41
+
Download the coco images from [link](http://mscoco.org/dataset/#download). We need 2014 training images and 2014 val. images. You should put the `train2014/` and `val2014/` in the same directory, denoted as `$IMAGE_ROOT`.
`prepro_feats.py` extract the resnet101 features (both fc feature and last conv feature) of each image. The features are saved in `data/cocotalk_fc` and `data/cocotalk_att`, and resulting files are about 200GB.
39
51
40
52
(Check the prepro scripts for more options, like other resnet models or other attention sizes.)
41
53
42
54
**Warning**: the prepro script will fail with the default MSCOCO data because one of their images is corrupted. See [this issue](https://github.com/karpathy/neuraltalk2/issues/4) for the fix, it involves manually replacing one image in the dataset.
43
55
56
+
### Download Bottom-up features (Skip if you are using resnet features)
57
+
58
+
Download pre-extracted feature from [link](https://github.com/peteanderson80/bottom-up-attention). You can either download adaptive one or fixed one.
This will create `data/cocobu_fc`, `data/cocobu_att` and `data/cocobu_box`. If you want to use bottom-up feature, you can just follow the following steps and replace all cocotalk with cocobu.
75
+
44
76
### Start training
45
77
46
78
```bash
@@ -68,8 +100,6 @@ First you should preprocess the dataset and get the cache for calculating cider
0 commit comments