Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models

This repository is the official PyTorch implementation of the TMLR 2026 paper: Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models

Abstract: Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP). However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective. In this paper, we tackle this issue from two perspectives: view refinement and description refinement, termed as Bi-refinement for Fine-grained Text-visual Alignment (BiFTA). View refinement removes redundant image patches with high Intersection over Union (IoU) ratios, resulting in more distinctive visual samples. Description refinement removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the remaining descriptions. BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP, justifying the necessity to remove redundant information in visual-text alignment.

Environment Setup

Python 3.12.3
CUDA 12.2.0
PyTorch 2.3.0

conda create -n bifta python=3.12
conda activate bifta
pip install -r requirements.txt

Dataset Preparation

Modify data_path in the config files under configs/ to point to your data location.

Supported datasets:

ImageNet
ImageNet-V2
CUB-200-2011
Oxford Pets
DTD
Food-101
Places365

Usage

Run Main Evaluation

python main.py --dataset [dataset] --seed [seed] --model_size [model_size]

Acknowledgements

This repo builds upon:

WCA
AttrVR

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
clip		clip
configs		configs
imagenet_classnames		imagenet_classnames
my_datasets		my_datasets
prompts		prompts
README.md		README.md
helper.py		helper.py
main.py		main.py
requirements.txt		requirements.txt
save_crops.py		save_crops.py
submit.slurm		submit.slurm
utils.py		utils.py
wca.py		wca.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models

Environment Setup

Dataset Preparation

Usage

Run Main Evaluation

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Languages

YUHAOSUNABC/BiFTA

Folders and files

Latest commit

History

Repository files navigation

Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models

Environment Setup

Dataset Preparation

Usage

Run Main Evaluation

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages