This repository presents an integration of the Dino V2 model for image embeddings with KMeans clustering for novelty detection in image data. The process involves data preparation, Dino V2 embeddings, and KMeans clustering with additional novelty and surprise metrics.
-
Data Preparation: Begin by uploading your image data to your Google Drive and establish a link to your Google Colab notebook.
-
Image Extraction: Extract the image files from the specified folder into the Colab working directory.
-
Data List Creation: The 'data' list will store the file paths of all PNG files found in the specified directory and its subdirectories.
-
CUDA Configuration: Create a symbolic link (soft link) to change the default CUDA version to CUDA 10, ensuring compatibility with the Dino V2 model.
-
GPU Setup: Set up CUDA on a GPU to leverage hardware acceleration.
-
Requirements: Add the 'requirements.txt' file to your Colab directory and download the necessary packages to use DINO embeddings. You can find the requirements file here.
-
Model Selection: Choose the Dino V2 model to be used. In this project, 'dinov2_vitg14,' referring to the Vision Transformer (ViT) version of the Dino V2 model, is employed.
-
Data Preprocessing: Preprocess the dataset, including resizing images, converting to tensors, and normalizing pixel values.
-
Forward Pass: Perform a forward pass through the pre-trained Dino V2 model for each image in the dataset, storing the output embeddings in an 'embeddings' list along with image file paths.
-
Conversion to NumPy: Convert the PyTorch tensor embeddings to NumPy arrays.
-
Embedding Storage: Save the embeddings to a folder in your Colab notebook for future use in novelty prediction and clustering.
-
Data Upload: Upload your Dino embeddings data to Google Drive and link it to your Google Colab notebook.
-
Embedding Extraction: Extract the embedding files from the specified folder in the Colab working directory.
-
Data List Creation: The 'data' list contains the file paths of '.npy' files in the specified directory and subdirectories.
-
Data Splitting: Split the data into training and test datasets, using an 80/20 ratio. Apply the KMeans model to the training set.
-
PCA Visualization: Visualize clustering results using Principal Component Analysis (PCA) based on the training set.
-
Optimal Cluster Count: Determine the optimal number of clusters (k) using the elbow method.
-
KMeans Re-Run: Re-run the KMeans model on the training set with the optimal k value.
-
Label Generation: Apply the trained KMeans model to generate labels for the test dataset.
-
Visualization: Plot a bar graph to display the distribution of clusters based on the test labels.
-
Average Distance Calculation: Calculate the average distance of samples to their respective cluster centroids.
-
Novelty Detection: Identify novel samples based on distances from their cluster centroids.
-
Pairwise Distance Calculation: Calculate pairwise distances between centroids in the KMeans model.
-
Average Centroid Distance: Calculate the average of pairwise distances between centroids.
-
Test Sample Assignment: Assign new test samples to the nearest cluster centroids.
-
Surprise Detection: Identify surprising samples based on distances from cluster centroids.
This integration enables effective image embedding, clustering, and novelty detection using the Dino V2 model combined with KMeans clustering and surprise detection metrics.