Transfer-Learning-Capabilities-of-Vision-Transformers-and-CNNs

Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block

Introduction

Recent advancements in the field of computer vision have seen the rise of transformer-based architectures, surpassing the state-of-the-art set by Convolutional Neural Networks (CNNs) in accuracy. However, these models are computationally expensive to train from scratch. This project investigates the transfer learning capabilities of vision transformers and CNNs by fine-tuning a single trainable block.

Abstract

This project explores the transfer learning capabilities of vision transformers and CNNs by fine-tuning only the last trainable block of pre-trained models on CIFAR-10. We compare their performance in terms of accuracy and efficiency. Our results show that transformer-based architectures not only achieve higher accuracy but also do so with fewer parameters compared to CNNs.

Ensure that the necessary datasets and pre-trained models are available in the specified directories.

Experimental Setup

Dataset

We use the CIFAR-10 dataset for fine-tuning the models pre-trained on ImageNet1K. CIFAR-10 contains 60,000 32x32 color images in 10 classes, with 50,000 training images and 10,000 test images. The images are upscaled to 224x224 using bilinear interpolation for compatibility with pre-trained models.

Models

We select a variety of CNN and transformer-based models, including:

CNNs: DenseNet201
Transformers: Vision Transformer CaiT-S24, DeiTBaseDistilled

Fine-Tuning

Fine-tuning is done by unfreezing only the last block of each model and training it on CIFAR-10. This approach helps evaluate the raw transfer learning capabilities of each architecture.

Hyperparameters

Learning Rate: 0.0001
Epochs: 10
Batch Size: 32

Results

The results of the experiments are summarized in the following table:

Model	Model type	Trainable parameters	Validation accuracy (paper)	Validation accuracy (my results)	Train accuracy	Training time per epoch	Validation time per epoch
Densenet201	CNN	235210	94.757	91.44	97.72	299.59s	53.12s
DeiTBaseDistilled	transformer	7103252	96.450	96.44	99.87	626.44s	108.66s
CaiTS24	transformer	1775376	96.00	96.78	99.89	590.62s	109.12s

Conclusion

The experiments conclude that transformer-based models generally outperform CNNs in terms of transfer learning efficiency and accuracy. Specifically, the DeiTBaseDistilled and CaiT-S24 models achieve the highest accuracy with fewer trainable parameters, highlighting the potential of transformer architectures in transfer learning tasks.

Contributing

Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or suggestions.

License

This project is licensed under the MIT License. See the LICENSE file for details.

References

Malpure, D., Litake, O. and Ingle, R., 2021. Investigating transfer learning capabilities of vision transformers and CNNs by fine-tuning a single trainable block. arXiv preprint arXiv:2110.05270.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. and Uszkoreit, J., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
AN IMAGE IS WORTH 16X16 WORDS.pdf		AN IMAGE IS WORTH 16X16 WORDS.pdf
Investigating Transfer Learning Capabilities of Vision Transformers and CNNs.pdf		Investigating Transfer Learning Capabilities of Vision Transformers and CNNs.pdf
Investigating_Transfer_Learning_Capabilities_of_Vision_Transformers_and_CNNs_by_Fine_Tuning_a_Single_Trainable_Block.ipynb		Investigating_Transfer_Learning_Capabilities_of_Vision_Transformers_and_CNNs_by_Fine_Tuning_a_Single_Trainable_Block.ipynb
LICENSE		LICENSE
README.md		README.md
persian_report_of_implementation.pdf		persian_report_of_implementation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transfer-Learning-Capabilities-of-Vision-Transformers-and-CNNs

Table of Contents

Introduction

Abstract

Experimental Setup

Dataset

Models

Fine-Tuning

Hyperparameters

Results

Conclusion

Contributing

License

References

About

Releases

Packages

Languages

License

mmahdibarghi/Transfer-Learning-Capabilities-of-Vision-Transformers-and-CNNs

Folders and files

Latest commit

History

Repository files navigation

Transfer-Learning-Capabilities-of-Vision-Transformers-and-CNNs

Table of Contents

Introduction

Abstract

Experimental Setup

Dataset

Models

Fine-Tuning

Hyperparameters

Results

Conclusion

Contributing

License

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages