IAA: Inner-Adaptor Architecture

This repository is the official implementation of IAA: Inner-Adaptor Architecture.

IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities
Bin Wang*, Chunyu Xie*, Dawei Leng†, Yuhui Yin(*Equal Contribution, ✝Corresponding Author)

We propose a MLLM based on Inner-Adaptor Architecture (IAA). IAA demonstrates that training with a frozen language model can surpass the models with fine-tuned LLMs in both multimodal comprehension and visual grounding tasks. Moreover, after deployment, our approach incorporates multiple workflows, thereby preserving the NLP proficiency of the language model. With a single download, the model can be finetuned to cater to various task specifications. Enjoy the seamless experience of utilizing our IAA model.

🔥 News

[2024/08/29] We put IAA on the huggingface community! 🤗.
[2024/08/29] We have updated the IAA github repository, and now you can test our models!
[2024/08/26] We released the paper of IAA: Inner-Adaptor Architecture.

Install

conda create -n IAA python=3.10 -y
conda activate IAA
bash deploy.sh

Model Performance

Main Results on General Multimodal Benchmarks.

Results on Visual Grounding Benchmarks.

Comparison on text-only question answering.

Quick Start 🤗

First pull off our model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from PIL import Image

checkpoint = "qihoo360/iaa-14-hf"

model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='cuda', trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
vision_tower = model.get_vision_tower()
vision_tower.load_model()
vision_tower.to(device="cuda", dtype=torch.float16)
image_processor = vision_tower.image_processor
tokenizer.pad_token = tokenizer.eos_token

terminators = [
    tokenizer.convert_tokens_to_ids("<|eot_id|>",)
]

Multimodal Workflow: task_type="MM"

image = Image.open("readpanda.jpg").convert('RGB')
query = "What animal is in the picture?"

inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)

input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)

output_ids = model.generate(
    input_ids,
    task_type="MM",
    images=images,
    do_sample=False,
    eos_token_id=terminators,
    num_beams=1,
    max_new_tokens=512,
    use_cache=True)

input_token_len = input_ids.shape[1]
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
print(outputs)

Grounding Workflow: task_type="G"

image = Image.open("COCO_train2014_000000014502.jpg").convert('RGB')
query = "Please provide the bounding box coordinate of the region this sentence describes: dude with black shirt says circa."

inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)

input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)

output_ids = model.generate(
    input_ids,
    task_type="G",
    images=images,
    do_sample=False,
    eos_token_id=terminators,
    num_beams=1,
    max_new_tokens=512,
    use_cache=True)
input_token_len = input_ids.shape[1]
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
print(outputs)

Text-only Workflow: task_type="Text"

query = "What is the approximate weight of an adult red panda?"
inputs = model.build_conversation_input_ids(tokenizer, query=query)

input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
images = None


output_ids = model.generate(
    input_ids,
    task_type="Text",
    images=images,
    do_sample=False,
    eos_token_id=terminators,
    num_beams=1,
    max_new_tokens=512,
    use_cache=True)

input_token_len = input_ids.shape[1]
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
print(outputs)

CLI Inference

Chat about images using IAA without the need of Gradio interface.

name="qihoo360/iaa-14-hf"
python -m iaa.eval.infer \
    --model-path $name \
    --image-path testimg/readpanda.jpg \
    --task_type MM \

name="qihoo360/iaa-14-hf"

python -m iaa.eval.infer_interleave \
    --model-path $name \
    --image-path testimg/COCO_train2014_000000014502.jpg \

Evaluation

First, download the MME image from the following link to ./MME/MME_Benchmark_release_version. https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation

bash scripts/mme.sh

For Refcoco testing, please refer to the following links for data downloads https://github.com/lichengunc/refer

bash scripts/refcoco.sh

We Are Hiring

We are seeking academic interns in the Multimodal field. If interested, please send your resume to [email protected].

Citation

If you find IAA useful for your research and applications, please cite using this BibTeX:

@article{Wang2024IAA,
  title={IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities},
  author={Bin Wang and Chunyu Xie and Dawei Leng and Yuhui Yin},
  journal={arXiv preprint arXiv:2408.12902},
  year={2024},
}

License

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the Apache license 2.0.

Related Projects

This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IAA: Inner-Adaptor Architecture

🔥 News

Contents

Install

Model Performance

Main Results on General Multimodal Benchmarks.

Results on Visual Grounding Benchmarks.

Comparison on text-only question answering.

Quick Start 🤗

First pull off our model

Multimodal Workflow: task_type="MM"

Grounding Workflow: task_type="G"

Text-only Workflow: task_type="Text"

CLI Inference

Evaluation

We Are Hiring

Citation

License

Related Projects

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
MME		MME
iaa		iaa
scripts		scripts
testimg		testimg
README.md		README.md
deploy.sh		deploy.sh
pyproject.toml		pyproject.toml

Qihoo360/Inner-Adaptor-Architecture

Folders and files

Latest commit

History

Repository files navigation

IAA: Inner-Adaptor Architecture

🔥 News

Contents

Install

Model Performance

Main Results on General Multimodal Benchmarks.

Results on Visual Grounding Benchmarks.

Comparison on text-only question answering.

Quick Start 🤗

First pull off our model

Multimodal Workflow: task_type="MM"

Grounding Workflow: task_type="G"

Text-only Workflow: task_type="Text"

CLI Inference

Evaluation

We Are Hiring

Citation

License

Related Projects

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages