This repository is the official PyTorch implementation of the TMLR 2026 paper: Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models
Abstract: Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP). However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective. In this paper, we tackle this issue from two perspectives: view refinement and description refinement, termed as Bi-refinement for Fine-grained Text-visual Alignment (BiFTA). View refinement removes redundant image patches with high Intersection over Union (IoU) ratios, resulting in more distinctive visual samples. Description refinement removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the remaining descriptions. BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP, justifying the necessity to remove redundant information in visual-text alignment.
- Python 3.12.3
- CUDA 12.2.0
- PyTorch 2.3.0
conda create -n bifta python=3.12
conda activate bifta
pip install -r requirements.txtModify data_path in the config files under configs/ to point to your data location.
Supported datasets:
- ImageNet
- ImageNet-V2
- CUB-200-2011
- Oxford Pets
- DTD
- Food-101
- Places365
python main.py --dataset [dataset] --seed [seed] --model_size [model_size]This repo builds upon: