FG-CLIP 2: 中英双语视觉语言对齐模型

本仓库是FG-CLIP及FG-CLIP 2的官方实现版本，作为新一代文本-图像跨模态模型，在细粒度理解方面表现卓越。FG-CLIP 2 支持中英双语，在 29 个数据集和 8 类多样化任务中，该模型超越包括SigLIP 2 和 MetaCLIP 2在内的强力基线模型，在两种语言任务中均取得目前的最佳性能。

FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model
Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng†, Yuhui Yin (*Equal Contribution, ✝Corresponding Author)

FG-CLIP: Fine-Grained Visual and Textual Alignment (code branch: v1.0)
Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng†, Yuhui Yin (*Equal Contribution, ✝Corresponding Author)

🔥 新闻

🚀 [2025/10/14] 我们已上传FG-CLIP 2代码和模型权重
🚀 [2025/10/14] 我们发布了论文 FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model
🚀 [2025/09/29] 我们刚刚开源了FG-CLIP的MCP服务器实现, 更多细节请点击 FGCLIP-MCP
🚀 [2025/07/29] 我们提供FG-CLIP 2 base模型的API访问，该模型在性能上显著优于FG-CLIP, 更多细节请查看 research.360.cn
🚀 [2025/07/09] 我们创建了两个演示demo，分别针对 fine-grained retrieval 和 dense feature display
🚀 [2025/05/09] 我们已将模型上传到 🤗(https://huggingface.co/qihoo360/fg-clip-large)，可以支持快捷使用！
🚀 [2025/05/09] 我们已更新FG-CLIP github仓库，现在您可以测试我们的模型了！
🚀 [2025/05/09] 我们发布了论文 FG-CLIP: Fine-Grained Visual and Textual Alignment.
🚀 [2025/05/02] FG-CLIP被ICML'25会议接收。

模型架构

我们的方法采用一个两阶段分层学习框架，从全局语义到细粒度细节，逐步增强视觉-语言对齐能力。

第一阶段：全局语义对齐
我们从大规模图像-文本对开始，每对数据均包含一个短文本描述（用于简洁的场景级描述）和一个长文本描述（用于丰富的上下文细节）。在此双语语料库上进行训练，可实现强大的全局对齐，为英文和中文的跨模态理解奠定坚实基础。

第二阶段：细粒度视觉-语言学习
在全局对齐表示的基础上，我们引入区域级监督信号和多种细粒度目标，以强化局部对应关系。具体包括：

细粒度视觉学习：通过 RoIAlign 提取的区域特征与短语级描述进行区域-文本对齐。
细粒度文本学习：利用属性扰动生成的 hard negative 样本，区分细微的文本差异。
带全局阈值同步的跨模态排序损失：采用动态边距的排序损失，并通过全局同步的阈值实现稳定的 hard negative 挖掘。
文本模态内对比损失：在单一语言内部进行对比学习，以区分语义相近但不同的区域描述。

安装

conda create -n FGCLIP2 python=3.10 -y
conda activate FGCLIP2
cd FG-CLIP && pip install -e .

模型仓库

模型	视觉编码器	模型权重	演示界面
FG-CLIP-Base	vit-base-patch16-224	🤗Huggingface	Retrieval & Dense Feature
FG-CLIP-Large	vit-large-patch14-336	🤗Huggingface
FG-CLIP2-Base	vit-base-patch16	🤗Huggingface	Retrieval & Dense Feature
FG-CLIP2-Large	vit-large-patch16	🤗Huggingface
FG-CLIP2-So400m	vit-so400m-patch16	🤗Huggingface

评测基准

数据集	链接
LIT-CN	🤗https://huggingface.co/datasets/qihoo360/LIT-CN
DCI-CN	🤗https://huggingface.co/datasets/qihoo360/DCI-CN
DOCCI-CN	🤗https://huggingface.co/datasets/qihoo360/DOCCI-CN
BoxClass-CN	🤗https://huggingface.co/datasets/qihoo360/BoxClass-CN

快速开始 🤗

加载模型

import torch
from PIL import Image
from transformers import (
    AutoImageProcessor,
    AutoTokenizer,
    AutoModelForCausalLM,
)


model_root = "fgclip2-base-patch16"
model = AutoModelForCausalLM.from_pretrained(model_root,trust_remote_code=True).cuda()

device = model.device

tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)

检索

def determine_max_value(image):
    w,h = image.size
    max_val = (w//16)*(h//16)
    if max_val > 784:
        return 1024
    elif max_val > 576:
        return 784
    elif max_val > 256:
        return 576
    elif max_val > 128:
        return 256
    else:
        return 128

img_root = "cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")

image_input = image_processor(images=image, max_num_patches=determine_max_value(image), return_tensors="pt").to(device)

# NOTE Short captions: max_length=64 walk_type="short"(default)
# NOTE Long captions: max_length=196 walk_type="long"

captions = [
"一个简约风格的卧室角落，黑色金属衣架上挂着多件米色和白色的衣物，下方架子放着两双浅色鞋子，旁边是一盆绿植，左侧可见一张铺有白色床单和灰色枕头的床。",
"一个简约风格的卧室角落，黑色金属衣架上挂着多件红色和蓝色的衣物，下方架子放着两双黑色高跟鞋，旁边是一盆绿植，左侧可见一张铺有白色床单和灰色枕头的床。",
"一个简约风格的卧室角落，黑色金属衣架上挂着多件米色和白色的衣物，下方架子放着两双运动鞋，旁边是一盆仙人掌，左侧可见一张铺有白色床单和灰色枕头的床。",
"一个繁忙的街头市场，摊位上摆满水果，背景是高楼大厦，人们在喧闹中购物。"
]
captions = [caption.lower() for caption in captions]

caption_input = tokenizer(captions, padding="max_length", max_length=196, truncation=True, return_tensors="pt").to(device)


with torch.no_grad():
  image_feature = model.get_image_features(**image_input)
  text_feature = model.get_text_features(**caption_input,walk_type="long")
  image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
  text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)

logits_per_image = image_feature @ text_feature.T
logit_scale, logit_bias = model.logit_scale.to(text_feature.device), model.logit_bias.to(text_feature.device)
logits_per_image = logits_per_image * logit_scale.exp() + logit_bias

密集特征效果展示

import math
import matplotlib
matplotlib.use('Agg') 
import matplotlib.pyplot as plt


img_root = "cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = resize_short_edge(image,target_size=2048)

image_input = image_processor(images=image, max_num_patches=16384, return_tensors="pt").to(device)
captions = ["电脑","黑猫","窗户","window","white cat","book"]

with torch.no_grad():
    dense_image_feature = model.get_image_dense_feature(**image_input)
    
    spatial_values = image_input["spatial_shapes"][0]
    real_h = spatial_values[0].item()
    real_w = spatial_values[1].item()
    real_pixel_tokens_num = real_w*real_h
    dense_image_feature = dense_image_feature[0][:real_pixel_tokens_num]
    captions = [caption.lower() for caption in captions]
    caption_input = tokenizer(captions, padding="max_length", max_length=64, truncation=True, return_tensors="pt").to(device)

    text_feature = model.get_text_features(**caption_input, walk_type="box")
    text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
    dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)

similarity = dense_image_feature @ text_feature.T
similarity = similarity.cpu()


num_classes = len(captions)
cols = 3
rows = (num_classes + cols - 1) // cols


aspect_ratio = real_w / real_h 

fig_width_inch = 3 * cols        
fig_height_inch = fig_width_inch / aspect_ratio * rows / cols  

fig, axes = plt.subplots(rows, cols, figsize=(fig_width_inch, fig_height_inch))
fig.subplots_adjust(wspace=0.01, hspace=0.01)

if num_classes == 1:
    axes = [axes]
else:
    axes = axes.flatten()

for cls_index in range(num_classes):
    similarity_map = similarity[:, cls_index].cpu().numpy()
    show_image = similarity_map.reshape((real_h, real_w))

    ax = axes[cls_index]
    ax.imshow(show_image, cmap='viridis', aspect='equal')  
    ax.set_xticks([])
    ax.set_yticks([])
    ax.axis('off')


for idx in range(num_classes, len(axes)):
    axes[idx].axis('off')

savename = "FGCLIP2_dfcolor_cat_all_2K.png"
plt.savefig(savename, dpi=150, bbox_inches='tight', pad_inches=0.05)
plt.close()

训练

数据准备

我们提供使用 🤗FineHARD dataset 进行第二阶段训练的代码。FineHARD 数据集包含1200万张图像、4000万个带有细粒度区域描述的边界框，以及1000万个hard negative样本。

关于数据准备，请参考 Data: FineHARD

准备训练

我们的训练和推理代码完全基于 Hugging Face 提供的 transformers 仓库，非常易于使用和复现。我们在 scripts 目录中提供了训练脚本。
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
我们的训练脚本支持 zero2、tf32 加速和 bf16 精度（注意 fp16 精度可能导致梯度 NAN）。如果您不满足上述条件，请关闭 tf32 并使用 torchrun 替代 deepspeed 启动。

bash scripts/train/stage2_fgclip2.sh

评测

数据准备

从以下链接下载 share-captioner_coco_lcs_sam_1246k_1107.json https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/share-captioner_coco_lcs_sam_1246k_1107.json

从以下链接下载 CocoCaptions 并放入 data/coco/annotations/

https://github.com/tylin/coco-caption

从以下链接下载 COCO 并放入 data/coco

https://cocodataset.org/dataset

DCI 的描述来自以下链接并放入 data/densely_captioned_images

https://github.com/facebookresearch/DCI

ImageNet-1K 来自以下链接并放入 data/IN1K_val

https://image-net.org/

ImageNet-v2 来自以下链接并放入 data/imagenetv2-matched-frequency-format-val

https://opendatalab.com/OpenDataLab/ImageNetV2/tree/main

bash scripts/eval/eval.sh

招聘中

我们正在招募多模态方向的学术实习生。如有兴趣，请将简历发送至 [email protected].

引用

如果您在研究或应用中发现 FG-CLIP 2 对您有所帮助，请使用以下 BibTeX 引用：

@article{xie2025fg2,
  title={FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model},
  author={Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Ao, Ji and Leng, Dawei and Yin, Yuhui},
  journal={arXiv preprint arXiv:2510.10921},
  year={2025}
}

@article{xie2025fg,
  title={FG-CLIP: Fine-Grained Visual and Textual Alignment},
  author={Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Zhang, Gengshen and Leng, Dawei and Yin, Yuhui},
  journal={arXiv preprint arXiv:2505.05071},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
data		data
fgclip2		fgclip2
scripts		scripts
use_imgs		use_imgs
LICENSE		LICENSE
README.md		README.md
README_en.md		README_en.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FG-CLIP 2: 中英双语视觉语言对齐模型

🔥 新闻

Contents

模型架构

安装

模型仓库

评测基准

快速开始 🤗

加载模型

检索

密集特征效果展示

训练

数据准备

准备训练

评测

数据准备

招聘中

引用

About

Uh oh!

Releases

Packages

Contributors 3

Languages

License

360CVGroup/FG-CLIP

Folders and files

Latest commit

History

Repository files navigation

FG-CLIP 2: 中英双语视觉语言对齐模型

🔥 新闻

Contents

模型架构

安装

模型仓库

评测基准

快速开始 🤗

加载模型

检索

密集特征效果展示

训练

数据准备

准备训练

评测

数据准备

招聘中

引用

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages