Computer Vision Lab (up to 10 points)

Basic implementations

Check basic implementations on CIFAR10 in the Deep Learning Lab project here

Goals:

Implement some basic convolutional networks
Implement different data augmentation
Implement VGG model

Advanced topologies

Wide Resnet (1 point)
Dense Nets (1 point)

Gender Recognition (3 point)

Images from "Labeled Faces in the Wild" dataset (LFW) in realistic scenarios, poses and gestures. Faces are automatically detected and cropped to 100x100 pixels RGB.

Training set: 10585 images

Test set: 2648 images

Python Notebook: here

Python code: here

Goals:

Implement a model with >98% accuracy over test set
Implement a model with >95% accuracy with less than 100K parameters

get some inspiration from Paper

Car Model identification with bi-linear models (5 points)

Images of 20 different models of cars.

Training set: 791 images

Test set: 784 images

Version 1. Two different CNNs:

Python code: here
Version 2. The same CNN (potentially a pre-trained model)

Python code: here

Goals:

Understand the above Keras implementations:
- Name the layers
- Built several models
- Understand tensors sizes
- Connect models with operations (outproduct)
- Create an image generator that returns a list of tensors
- Create a data flow with multiple inputs for the model

Suggestion:

Load a pre-trained VGG16, Resnet... model
Connect this pre-trained model and form a bi-linear
Train freezing weights first, unfreeze after some epochs, very low learning rate
Accuracy >65% is expected

Paper

Image classification with transformers vs. CNN (5 puntos)

Global objective: Compare classification performance of finetuned vision transformers vs CNN

Task descritpion:

Download and setup one of the proposed datasets
Finetune a simple CNN (ResNet, EfficientNet, or MobileNet) for baseline comparison (free choice of CNN)
Finetune a Vision transformer (ViT, Swin, Maxvit) of free choice using timm or huggingface. Please select a model that fits your computational capabilities
Compare: accuracy, training speed, inference time.

Datasets: Students should choose between one of these datasets:

flowers-102 (Oxford 102 Category Flowers)
- Fine-grained classification with 102 flower species ssmall enough to train within a reasonable time.
- Size: 8,189 images, 102 classes.
- Difficulty: Small inter-class variability, small dataset (risk of overfitting).
- Dataset Link: Flowers-102
Stanford Cars
- Middle size dataset with high inter-class similarity
- Size: ~16000 images, 196 calses
- Difficulty: requires attention to details, making a good test to compare vit and CNNs
- Dataset Link: Available in torchvision (see instructions for download there)

Results:

Check literature to know expected accuracy
Organize results clearly, effect of learning rate, batch size, scheduing, freezing of layers....
Having competitive classification results will be a plus
It will be also a plus if different transformer architectures are compared (for instance comparison of ViT models size)

Visualizing attention maps in vision transformers (3 puntos)

Global objective:

Implement a visualization pipeline to display attention maps from different layers of a ViT

Task description:

Use a pretrained ViT either from timm or Huggingface
Get some pictures compatible with any imagenet class (car, dog, cat...)
Write code to extract attention maps from the ViT model
Pass images through the model and extract attention weights
Write code to display attention maps overlayed on the imput image to see important regions for the ViT.
Compare and analyze attention maps of different layers and different images.

Notes:

The specific code for extracting attention maps may change depending on the specific implementation of the model
You can try and discuss different visualization techniques to better understand the attention patterns. For instance, visualize attention maps of each head or fusing attention from several heads of the same layer.

Extras:

Implement attention rollout and compare with atention of individual layers
Compare attention maps with activation maps (gradcam) of a pretrained CNN

Image colorization (3 point)

Code extracted and adapted from github

Goals:

Understand the above Keras implementations:
- How to load the inception net
- How to merge encoder and inception result

Python code: here

Need help? Read

Image segmentation (4 points)

ISIC Melanoma Segmentation

Dataset available here:

ISIC_Melanoma

Exercise:

Implement a UNET for this task.
Split data 80% training 20% test
Get results over test set

Other project?

You are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
images		images
notebook		notebook
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Computer Vision Lab (up to 10 points)

Basic implementations

Advanced topologies

Gender Recognition (3 point)

Car Model identification with bi-linear models (5 points)

Image classification with transformers vs. CNN (5 puntos)

Visualizing attention maps in vision transformers (3 puntos)

Image colorization (3 point)

Image segmentation (4 points)

Other project?

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

RParedesPalacios/ComputerVisionLab

Folders and files

Latest commit

History

Repository files navigation

Computer Vision Lab (up to 10 points)

Basic implementations

Advanced topologies

Gender Recognition (3 point)

Car Model identification with bi-linear models (5 points)

Image classification with transformers vs. CNN (5 puntos)

Visualizing attention maps in vision transformers (3 puntos)

Image colorization (3 point)

Image segmentation (4 points)

Other project?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages