Skip to content

brunogaldos/Vision-Transformers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Vision Transformer (ViT) for Image Classification

Overview

This repository contains a PyTorch implementation of a Vision Transformer (ViT) for image classification, based on the Oxford-IIIT Pet Dataset. Vision Transformers apply the principles of transformer models, originally used for NLP, to computer vision tasks, enabling them to process image data in a novel and effective way.

In this implementation, we preprocess images from the Oxford-IIIT Pet dataset and feed them into a Vision Transformer for classification.

Requirements

To run this project, you will need to install the following Python packages:

  • torch: PyTorch deep learning framework.
  • torchvision: For datasets and image transformations.
  • einops: For tensor manipulations, used in rearranging tensor dimensions.
  • matplotlib: For visualizing images.

You can install the required dependencies via pip:

pip install torch torchvision einops matplotlib

Dataset

We utilize the Oxford-IIIT Pet dataset, which consists of images of 37 breeds of pets. The dataset includes a variety of images with multiple classes, making it a great test case for image classification models.

You can download the dataset through torchvision.datasets as follows:

from torchvision.datasets import OxfordIIITPet
from torchvision.transforms import Resize, ToTensor
dataset = OxfordIIITPet(root='data', download=True, transform=ToTensor())

image

Model Architecture

Vision Transformer (ViT)

The Vision Transformer (ViT) applies the transformer architecture to image data by splitting the input image into patches, linearly embedding each patch, and then applying self-attention mechanisms across the patches.

Key components of the ViT model:

  • Image Patching: Images are split into smaller patches, each treated as a sequence element.
  • Linear Embedding: Each patch is linearly embedded into a fixed-dimensional space.
  • Positional Encoding: The model adds positional information to each patch, enabling it to learn spatial dependencies.
  • Transformer Layers: The core transformer model is used to process the embedded patches and extract relevant features.

Transformer Encoder

The transformer encoder layer uses self-attention to compute relationships between patches and extract global information across the entire image.

Visualizing Image Patches

The transformer processes images as patches.

Contributing

If you'd like to contribute to this project, please follow these steps:

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature-branch).
  3. Make your changes.
  4. Commit your changes (git commit -m 'Add new feature').
  5. Push to the branch (git push origin feature-branch).
  6. Open a Pull Request.

Feel free to raise an issue if you find any bugs or have suggestions for improvements!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published