Skip to content

PyTorch implementation of some deep learning vision models.

License

Notifications You must be signed in to change notification settings

joh-schb/vision-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Image Classification Models

Implementation of a few popular vision models in PyTorch.

Results

Results of some trainings on CIFAR10. I train the models with AdamW optimizer for 90 pochs using a cosine decay learning rate scheduler and 5 epochs linear warm-up. Please note, that the reported accuracies are far from what is possible with those models, as I just train them for a couple of epochs and don't finetune them at all. ;)

Paper Code Params Accuracy
ResNet resnet 175,594 90.8%
ConvNeXt convnext 398,730 76.2%
ViT vit 305,802 69.9%
Hierarchical Perceiver hip 1,088,970 57.6%

Usage

You can train the models with

python3 main.py resnet --epochs 90 --batch-size 256 --warmup-epochs 10 --name exp1

A list of supported models can be found in the results section (code column).

Models

ResNet

He et al. (2016) introduced skip connections to build deeper models.

from models import ResNet

model = ResNet()

x = torch.randn((64, 3, 32, 32))
model(x).shape      # [64, 10] 

ViT

Dosovitskiy et al. (2020) propose the Vision Transformer (ViT), which first patchifies the image and then simply applies the NLP Transformer encoder.

from models import ViT

model = ViT()

x = torch.randn((64, 3, 32, 32))
model(x).shape      # [64, 10] 

Hierarchical Perceiver

Carreira et al. (2022) improve the efficiency of the Perceiver by making it hierarchical. For that, the authors propose the HiP-Block which divides the input sequence into groups and independently applies cross- and self-attention to those groups. Stacking multiple of those blocks results in the respective hierarchy.

import yaml
from models import HierarchicalPerceiver

cfg = yaml.load(open('configs/hip.yaml', 'r'), Loader=yaml.Loader)
model = HierarchicalPerceiver(**cfg)

x = torch.randn((64, 3, 32, 32))
model(x).shape      # [64, 10] 

For this implementation I used standard 2D sinusoidal instead of learned positional embeddings. Furthermore, I only train on classification with the HiP encoder. However, you can add the decoder simply by editing the config file of the HiP (add some more blocks with decreasing latent_dim and increasing sequence length). Then the proposed masked auto-encoder pre-training (MAE) is quite straight-forward.

ConvNeXt

Liu et al. (2022) gradually modernize a standard ResNet, by adapting the training procedure (optimizer, augmentations & regularizations), the macro design (stage compute ratio, patchify stem, depthwise separable convolutions & inverted bottleneck), and the micro design (GELU, fewer activation and normalization functions, layer normalization & convolutional downsampling).

from models import ConvNeXt

model = ConvNeXt()

x = torch.randn((64, 3, 224, 224))
model(x).shape      # [64, 1000] 

About

PyTorch implementation of some deep learning vision models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages