LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
-
Updated
Apr 1, 2025 - Python
LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
Augmented-Waste-Classifier-SigLIP2 is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224
Facial-Emotion-Detection-SigLIP2 is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224
Age-Classification-SigLIP2 is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to predict the age group of a person from an image using the SiglipForImageClassification architecture.
classify handwritten digits (0-9)
Fashion-Mnist-SigLIP2 is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify images into Fashion-MNIST categories using the SiglipForImageClassification architecture.
Traffic-Density-Classification is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify images into traffic density categories using the SiglipForImageClassification architecture.
Multisource-121-DomainNet is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify images into 121 domain categories using the SiglipForImageClassification architecture.
Gender-Classifier-Mini is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify images based on gender using the SiglipForImageClassification architecture.
Deepfake vs Real is a dataset designed for image classification, distinguishing between deepfake and real images.
Clipart-126-DomainNet is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify clipart images into 126 domain categories using the SiglipForImageClassification architecture
Human-vs-NonHuman-Detection is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify images as either human or non-human using the SiglipForImageClassification architecture.
Bird-Species-Classifier-526 is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224
Painting-126-DomainNet is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify paintings into 126 domain categories using the SiglipForImageClassification architecture
Fire-Detection-Siglip2 is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to detect fire, smoke, or normal conditions using the SiglipForImageClassification architecture.
Hand-Gesture-2-Robot is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to recognize hand gestures and map them to specific robot commands using the SiglipForImageClassification architecture.
Gym-Workout-Classifier-SigLIP2 is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224
Alphabet-Sign-Language-Detection is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify images into sign language alphabet categories using the SiglipForImageClassification architecture.
SAT-Landforms-Classifier is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify satellite images into different landform categories using the SiglipForImageClassification architecture
Food-101-93M is a fine-tuned image classification model built on top of google/siglip2-base-patch16-224 using the SiglipForImageClassification architecture. It is trained to classify food images into one of 101 popular dishes, derived from the Food-101 dataset.
Add a description, image, and links to the siglip2 topic page so that developers can more easily learn about it.
To associate your repository with the siglip2 topic, visit your repo's landing page and select "manage topics."