Implementation of a few popular vision models in PyTorch.
Results of some trainings on CIFAR10
. I train the models with AdamW optimizer for 90 pochs using
a cosine decay learning rate scheduler and 5 epochs linear warm-up. Please note, that the reported
accuracies are far from what is possible with those models, as I just train them for a couple of
epochs and don't finetune them at all. ;)
Paper | Code | Params | Accuracy |
---|---|---|---|
ResNet | resnet | 175,594 | 90.8% |
ConvNeXt | convnext | 398,730 | 76.2% |
ViT | vit | 305,802 | 69.9% |
Hierarchical Perceiver | hip | 1,088,970 | 57.6% |
You can train the models with
python3 main.py resnet --epochs 90 --batch-size 256 --warmup-epochs 10 --name exp1
A list of supported models can be found in the results section (code column).
He et al. (2016) introduced skip connections to build deeper models.
from models import ResNet
model = ResNet()
x = torch.randn((64, 3, 32, 32))
model(x).shape # [64, 10]
Dosovitskiy et al. (2020) propose the Vision Transformer (ViT), which first patchifies the image and then simply applies the NLP Transformer encoder.
from models import ViT
model = ViT()
x = torch.randn((64, 3, 32, 32))
model(x).shape # [64, 10]
Carreira et al. (2022) improve the efficiency of the Perceiver by making it hierarchical. For that, the authors propose the HiP-Block which divides the input sequence into groups and independently applies cross- and self-attention to those groups. Stacking multiple of those blocks results in the respective hierarchy.
import yaml
from models import HierarchicalPerceiver
cfg = yaml.load(open('configs/hip.yaml', 'r'), Loader=yaml.Loader)
model = HierarchicalPerceiver(**cfg)
x = torch.randn((64, 3, 32, 32))
model(x).shape # [64, 10]
For this implementation I used standard 2D sinusoidal instead of learned positional embeddings. Furthermore, I only
train on classification with the HiP encoder. However, you can add the decoder simply by editing the
config file of the HiP (add some more blocks with decreasing latent_dim
and increasing sequence length). Then
the proposed masked auto-encoder pre-training (MAE) is quite
straight-forward.
Liu et al. (2022) gradually modernize a standard ResNet, by adapting the training procedure (optimizer, augmentations & regularizations), the macro design (stage compute ratio, patchify stem, depthwise separable convolutions & inverted bottleneck), and the micro design (GELU, fewer activation and normalization functions, layer normalization & convolutional downsampling).
from models import ConvNeXt
model = ConvNeXt()
x = torch.randn((64, 3, 224, 224))
model(x).shape # [64, 1000]