- Nikolay Blagoev (4998901 - [email protected])
- William Narchi (5046122 - [email protected])
Image-to-Image Translation with Conditional Adversarial Networks.
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros. In CVPR 2017. [Bibtex]
Conditional adversarial networks are a popular architecture choice for generative models. The Pix2Pix paper examined by this reproducibility project presents a general CGAN (Conditional Generative Adversarial Network) that can be adapted for any image-to-image translation task. Put simply, given labelled samples in two related domains
The architecture of the network consists of a generator network and a discriminator network working against each other [1]. The generator learns to generate plausible fake images in the target range that correspond to the source domain and the discriminator learns to differentiate real and fake (generator-created) images. Being a CGAN, the generator is provided as input both a sample from the domain and its corresponding class, hence it attempts to learn the distribution of the data conditioned on the corresponding label [11].
The Pix2Pix paper utilises a modified variant of the UNet architecture as the generator [1], [2]. However, in principle, any network that performs semantic segmentation can be used as the generator in the CGAN architecture. This project aims to compare the viability of popular semantic segmentation networks as replacements for the stock UNet presented in the original paper.
- UNet [Architecture and implementation specified in Pix2Pix paper and stock codebase]
- ResNet w/ 9 blocks [Paper] [Implementation in stock codebase]
- UNet++ [Paper] [SMP]
- DeepLabV3+ [Paper] [SMP]
- PSPNet [Paper] [SMP]
- HRNet [Paper] [Implementation]
- LinkNet [Paper] [SMP]
For each we model, we adapted its structure to follow the generator structure described in the Pix2Pix paper (convolution → batch normalisation → ReLu) [1]. For most models, the SMP API implementation was utilised.
Due to time limitations we tested only on the facades
dataset [3].
A generator network was trained with each decoder for 200 epochs. The final results were then evaluated qualitatively (visual appearance) and quantitatively (via the FID score and a comparison of the loss scores). We chose the measures from 4 based on how useful they were for our purposes.
Following the architecture described in the paper (convolution → batch normalisation → ReLu), we encountered mode collapse (the generator found a single image which would consistently trick the discriminator):
This happened regardless of the exact model used as the generator. If the last ReLU activation layer was removed, patchy artifacts were produced (even at 200 epochs):
The original implementation adds a Tanh activation function at the outermost upscaling layer:
if outermost:
upconv = nn.ConvTranspose2d(inner_nc * 2, outer_nc,
kernel_size=4, stride=2,
padding=1)
down = [downconv]
up = [uprelu, upconv, nn.Tanh()]
model = down + [submodule] + up
Thus, we followed the same structure. All layers apart from the last one have a ReLU activation while the outtermost one has a Tanh.
Real | UNet | ResNet (9 blocks) | UNet++ | DeepLabV3+ | PSPNet | LinkNet | HRNet |
---|---|---|---|---|---|---|---|
All generators were able to recreate some semblance of structure in the fake (or generated) image. Some notion of windows and a facade exist in all of them.
Visually, PSPNet gave the worst results; the final result is blurry and black patches can be seen in the same spot on all images. The second worst was DeepLabV3+; a more clear structure can be seen in it, however some artifacts exist (bottom row is best seen) and the images are quite blurry. HRNet gave decent results, however they still look quite blurry. Surprisingly, the LinkNet produced a very clear and coherent image for the first input. The best performing were the two UNets, followed closely by LinkNet and the 9-block ResNet, though for the second row some artifacts can be seen (quite noticeable with UNet++ and 9-block Resnet, in addition to some with UNet at the bottom part of the building).
The Frechet Inception Distance (FID) is used to evaluate the quality of generated images. It compares the excitation of a feature extractor of the ground truth and generated images to produce a single scalar score, where lower means better. [5] Since the release of the paper in 2017, it has become a de facto standard for evaluating the performance of generative networks [4].
For the feature extractor, we chose the InceptionV3 model [6]. A batch of 40 previously unseen images were fed to the generator. The new 'fake' images were then compared with the ground truth and the FID scores for each generator (evaluated after 200 epochs) are given below:
Generator | FID |
---|---|
UNet (default) | 218.483 |
ResNet (9 blocks) | 226.530 |
LinkNet | 232.488 |
UNet++ | 244.796 |
HRNet | 297.469 |
DeepLabV3+ | 318.598 |
PSPNet | 416.964 |
PSPNet performed the worst (as evident by the results). Surprisingly, LinkNet presented a better result than UNet++. The stock UNet variant used in the original paper performed the best, but we attribute this to hyperparameter tuning, which we were not able to perform due to limited training time.
Provided are the values of the loss terms for both the generator and discriminator plotted against training epochs for all tested generators. All generators exhibit a decrease in L1 loss over time. As for GAN loss, it remains fairly stationary for the more performant generators, while increasing steadily for the worse performing ones. Similarly, the discriminator's losses seem to fluctuate with the more performant generators whereas it remains fairly stationary for the worse performing ones, demonstrating the performant generators' abilities to fool the discriminator more consistently than the worse performing ones.
In order to better understand how the generators' performance evolves over time, we provide graphs of FID scores plotted against training epochs for all tested generators.
All tested generators exhibit a general performance improvement as the number of training epochs increases. The most performant ones have an almost monotonically decreasing trend, while the least performant have several erratic spikes. This might be indicative of poor hyperparameter tuning or the generator's inability to adequately generalise over the course of training.
A metric of interest is the colour distribution of the outputs of each generator. Our hypothesis is that generators which give the best results also approximate well the colour distributions of the original images. Also, it is interesting to investigate whether some generators demonstrate a preference for certain colour extremes (darker images, more blue, etc).
We chose to conduct this investigation with the stock UNet, UNet++, LinkNet, and PSPNet only as these provide a good overview of different performance levels and as analysis of the results from other tested networks would yield very similar conclusions. The stock UNet represents the baseline (and best performance) to compare to. UNet++ is intended to be an improvement to UNet and so it is of interest to analyse due to its close relation with the paper's stock CGAN architecture. LinkNet provided the best results of the non-stock generators that were examined and so further analysis of it in order to visualise why it performs so well would be of interest. Lastly, PSPNet provided the worst results of the tested networks and would be similarly interesting to analyse in order to gauge its ill-fitedness for the task.
First we investigate the results of the stock UNet generator.
As can be seen - for all channels - the distributions of the real and fake images have a similar mean and a bump can be seen at the highest values (above 240). The spikes in the real distribution for values lower than 20 are because many of the source images include black frames blocking parts of the image. The UNet generator estimates the true colour distribution well.
Next, we investigate the results of the UNet++ decoder.
The UNet++ decoder estimates even better the true distribution with a much more noticeable spike at the mean value.
Next is LinkNet.
LinkNet exhibits performance somewhere between UNet++ and UNet, with a much lower spike in the mean values than UNet++.
Lastly is PSPNet
PSPNet seems to have been influenced a lot more by the dark patches, generating many more more black pixels compared to the other generators. It also has a slightly off mean value for the blue pixel intensity distribution.
The original paper used a UNet-based autoencoder with 6 downsampling (and corresponding upsampling layers) for their generator [1]. UNet was originally developed for biomedical image segmentation and was shown to outperform most other networks in most tasks where data is sparse [2]. The facades
dataset consists of about 500 images, which could be one of the resons why it is able to produce better results than other decoders [3].
The UNet++ was designed as an improvement to the original UNet network. It made use of improved skip connections and deep suppervision, the latter allowing for more stable results and faster convergeance [7]. In [7] they demonstrated a minor improvement of the UNet++ autoencoder over its predecessor. Thus we expected the UNet++ to perform as well, if not better than the stock network. Throughout our tests we saw it perform close to the UNet autoencoder. As mentioned in the limitations section, we believe that with some hyper parameter tuning, UNet++ and LinkNet would have seen a decent improvement in performance.
Unlike the previous two, LinkNet was not designed for the biomedical domain, but was instead intended for real-time visual semantic segmentation [8]. It has an architecture similar to UNet, consisting of a downsampling part (convolution with ReLu and spatial maxpooling, as well as skip connections to the corresponding upsampling block) [8]. In our experiments it gave one of the sharpest (i.e. not blurry) and most structured outputs.
The worst performing one, PSPNet, was expected to give one of the best results in this study. The network was originally designed for tasks similar to the facades segmentation. One reason we identify as potential cause of PSPNet's malperfomance, is that in the original paper the authors do not make use of skip connections but instead rely on deep supervision. For our training we however did not make use of this technique.
Of interest are two works [9] and [10], which both compared the performance of different autoencoder architectures on the same task. The former found that UNet and Linknet gave similar results, while both outperformed quite significantly PSPNet. The latter found a noticeable improvement of LinkNet above UNet. Our own findings mirrored those of the two papers, with the two UNet networks performing similar to LinkNet, and PSPNet giving a decently worse performance. Bare in mind that both of these were performed for the medical domain, which could give some bias to the performance of unet. To the best of our knowledge, there is no comparison performed on different network architectures for other domains.
Due to time restrictions, the generators were trained only on the facade
dataset. It would be interesting to see if the results also hold for other labeled datasets on which Pix2Pix was evaluated.
Also, as mentioned before, we were not able to perform hyperparameter tuning, which we recognise as a potential reason why all architectures proposed by us performed worse than the defaul UNet (the one used in the original paper). However, as seen, LinkNet and UNet++ both came close in performance without any additional optimisation.
- Nikolay Blagoev
- Added UNet++, LinkNet, and PSPNet
- Performed FID evaluation and colour distribution comparison
- Wrote qualitative evaluation and discussion
- William Narchi
- Added HRNet and DeeplabV3+
- Tested each generator's performance over epochs
- Trained the models
- Wrote introduction
[1] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[2] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18 (pp. 234-241). Springer International Publishing.
[3] R. Tyleček and R. Šára, “Spatial pattern templates for recognition of objects with regular structure,” Lecture Notes in Computer Science, pp. 364–374, 2013.
[4] A. Borji, “Pros and cons of gan evaluation measures,” Computer Vision and Image Understanding, vol. 179, pp. 41–65, 2019.
[5] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, ‘GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium’, in Advances in Neural Information Processing Systems, 2017, vol. 30.
[6] C. Szegedy et al., ‘Going Deeper with Convolutions’, CoRR, vol. abs/1409.4842, 2014.
[7] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, ‘UNet++: A Nested U-Net Architecture for Medical Image Segmentation’, CoRR, vol. abs/1807.10165, 2018.
[8] A. Chaurasia and E. Culurciello, ‘LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation’, CoRR, vol. abs/1707.03718, 2017.
[9] P. Bizopoulos, N. Vretos, and P. Daras, ‘Comprehensive Comparison of Deep Learning Models for Lung and COVID-19 Lesion Segmentation in CT scans’, arXiv [eess.IV]. 2022.
[10] V. A. Natarajan, M. Sunil Kumar, R. Patan, S. Kallam, and M. Y. Noor Mohamed, ‘Segmentation of Nuclei in Histopathology images using Fully Convolutional Deep Neural Architecture’, in 2020 International Conference on Computing and Information Technology (ICCIT-1441), 2020, pp. 1–7.
[11] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative Adversarial Networks, 2014.
- Clone this repo:
git clone https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix
cd pytorch-CycleGAN-and-pix2pix
- Install PyTorch and 0.4+ and other dependencies (e.g., torchvision, visdom and dominate).
- For pip users, please type the command
pip install -r requirements.txt
. - For Conda users, you can create a new Conda environment using
conda env create -f environment.yml
. - For Docker users, we provide the pre-built Docker image and Dockerfile. Please refer to our Docker page.
- For Repl users, please click .
- For pip users, please type the command
- Download a pix2pix dataset (e.g.facades):
bash ./datasets/download_pix2pix_dataset.sh facades
- To view training results and loss plots, run
python -m visdom.server
and click the URL http://localhost:8097. - To log training progress and test images to W&B dashboard, set the
--use_wandb
flag with train and test script - Train a model:
#!./scripts/train_pix2pix.sh
python train.py --dataroot ./datasets/facades --name facades_pix2pix --model pix2pix --direction BtoA
To see more intermediate results, check out ./checkpoints/facades_pix2pix/web/index.html
.
- Test the model (
bash ./scripts/test_pix2pix.sh
):
#!./scripts/test_pix2pix.sh
python test.py --dataroot ./datasets/facades --name facades_pix2pix --model pix2pix --direction BtoA
- The test results will be saved to a html file here:
./results/facades_pix2pix/test_latest/index.html
. You can find more scripts atscripts
directory. - To train and test pix2pix-based colorization models, please add
--model colorization
and--dataset_mode colorization
. See our training tips for more details.
Download a pre-trained model with ./scripts/download_pix2pix_model.sh
.
- Check here for all the available pix2pix models. For example, if you would like to download label2photo model on the Facades dataset,
bash ./scripts/download_pix2pix_model.sh facades_label2photo
- Download the pix2pix facades datasets:
bash ./datasets/download_pix2pix_dataset.sh facades
- Then generate the results using
python test.py --dataroot ./datasets/facades/ --direction BtoA --model pix2pix --name facades_label2photo_pretrained
-
Note that we specified
--direction BtoA
as Facades dataset's A to B direction is photos to labels. -
If you would like to apply a pre-trained model to a collection of input images (rather than image pairs), please use
--model test
option. See./scripts/test_single.sh
for how to apply a model to Facade label maps (stored in the directoryfacades/testB
). -
See a list of currently available models at
./scripts/download_pix2pix_model.sh
We provide the pre-built Docker image and Dockerfile that can run this code repo. See docker.
Download pix2pix/CycleGAN datasets and create your own datasets.
Best practice for training and testing your models.
Before you post a new question, please first look at the above Q & A and existing GitHub issues.
If you plan to implement custom models and dataset for your new applications, we provide a dataset template and a model template as a starting point.
To help users better understand and use our code, we briefly overview the functionality and implementation of each package and each module.
You are always welcome to contribute to this repository by sending a pull request.
Please run flake8 --ignore E501 .
and python ./scripts/test_before_push.py
before you commit the code. Please also update the code structure overview accordingly if you add or remove files.
If you use this code for your research, please cite our papers.
@inproceedings{CycleGAN2017,
title={Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks},
author={Zhu, Jun-Yan and Park, Taesung and Isola, Phillip and Efros, Alexei A},
booktitle={Computer Vision (ICCV), 2017 IEEE International Conference on},
year={2017}
}
@inproceedings{isola2017image,
title={Image-to-Image Translation with Conditional Adversarial Networks},
author={Isola, Phillip and Zhu, Jun-Yan and Zhou, Tinghui and Efros, Alexei A},
booktitle={Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on},
year={2017}
}
contrastive-unpaired-translation (CUT)
CycleGAN-Torch |
pix2pix-Torch | pix2pixHD|
BicycleGAN | vid2vid | SPADE/GauGAN
iGAN | GAN Dissection | GAN Paint
If you love cats, and love reading cool graphics, vision, and learning papers, please check out the Cat Paper Collection.
Our code is inspired by pytorch-DCGAN.