Image Classification Models

Implementation of a few popular vision models in PyTorch.

Results

Results of some trainings on CIFAR10. I train the models with AdamW optimizer for 90 pochs using a cosine decay learning rate scheduler and 5 epochs linear warm-up. Please note, that the reported accuracies are far from what is possible with those models, as I just train them for a couple of epochs and don't finetune them at all. ;)

Paper	Code	Params	Accuracy
ResNet	resnet	175,594	90.8%
ConvNeXt	convnext	398,730	76.2%
ViT	vit	305,802	69.9%
Hierarchical Perceiver	hip	1,088,970	57.6%

Usage

You can train the models with

python3 main.py resnet --epochs 90 --batch-size 256 --warmup-epochs 10 --name exp1

A list of supported models can be found in the results section (code column).

Models

ResNet

He et al. (2016) introduced skip connections to build deeper models.

from models import ResNet

model = ResNet()

x = torch.randn((64, 3, 32, 32))
model(x).shape      # [64, 10]

ViT

Dosovitskiy et al. (2020) propose the Vision Transformer (ViT), which first patchifies the image and then simply applies the NLP Transformer encoder.

from models import ViT

model = ViT()

x = torch.randn((64, 3, 32, 32))
model(x).shape      # [64, 10]

Hierarchical Perceiver

Carreira et al. (2022) improve the efficiency of the Perceiver by making it hierarchical. For that, the authors propose the HiP-Block which divides the input sequence into groups and independently applies cross- and self-attention to those groups. Stacking multiple of those blocks results in the respective hierarchy.

import yaml
from models import HierarchicalPerceiver

cfg = yaml.load(open('configs/hip.yaml', 'r'), Loader=yaml.Loader)
model = HierarchicalPerceiver(**cfg)

x = torch.randn((64, 3, 32, 32))
model(x).shape      # [64, 10]

For this implementation I used standard 2D sinusoidal instead of learned positional embeddings. Furthermore, I only train on classification with the HiP encoder. However, you can add the decoder simply by editing the config file of the HiP (add some more blocks with decreasing latent_dim and increasing sequence length). Then the proposed masked auto-encoder pre-training (MAE) is quite straight-forward.

ConvNeXt

Liu et al. (2022) gradually modernize a standard ResNet, by adapting the training procedure (optimizer, augmentations & regularizations), the macro design (stage compute ratio, patchify stem, depthwise separable convolutions & inverted bottleneck), and the micro design (GELU, fewer activation and normalization functions, layer normalization & convolutional downsampling).

from models import ConvNeXt

model = ConvNeXt()

x = torch.randn((64, 3, 224, 224))
model(x).shape      # [64, 1000]

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
configs		configs
models		models
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataloader.py		dataloader.py
main.py		main.py
requirements.txt		requirements.txt
test_models.py		test_models.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Classification Models

Results

Usage

Models

ResNet

ViT

Hierarchical Perceiver

ConvNeXt

About

Releases

Packages

Languages

License

joh-schb/vision-models

Folders and files

Latest commit

History

Repository files navigation

Image Classification Models

Results

Usage

Models

ResNet

ViT

Hierarchical Perceiver

ConvNeXt

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages