Auto model & pipeline for image-text-to-image-text models #32926

leloykun · 2024-08-22T01:38:44Z

Feature request

This is a tracker issue for work on interleaved in-and-out image-text generation.

There are now >= 5 open-source models that can do interleaved image-text generation--and many more are expected to be released. Thus, it would now be practical & useful for us to (1) add native support for such models and (2) standardize the logic flow of data through processors and pipelines as done in #31911 and #32472

Model	Github	Notes	PR
Anole	https://github.com/GAIR-NLP/anole	-	#32013
Chameleon	https://github.com/facebookresearch/chameleon	-	#32013
Llava-NeXT-Interleaved	https://github.com/LLaVA-VL/LLaVA-NeXT	-	-
Lumina-mGPT	https://github.com/Alpha-VLLM/Lumina-mGPT	-	-
Show-o	https://github.com/showlab/Show-o	-	-
Transfusion	-	Not open-source (yet, perhaps)	-
XGen-MM	https://github.com/salesforce/LAVIS/tree/xgen-mm	The paper & the github repo don't actually demonstrate interleaved image-text generation yet, but they did train the model on such datasets & the model architecture(s) is perfectly suited for it	-

Initial work for Chameleon & Anole can be found here: #32013 for reference.

Notes:

We explicitly exclude models that can only do text-only generation or image-only generation. We also exclude models that can do image-text generation but not in an interleaved manner.
As I've demonstrated in my repo, explicitly implementing the Finite State Machine (FSM) for switching between text-generation and image-generation modes as done in Chameleon's repo is not necessary. Implicitly implementing the FSM with Logits Processors suffices. Although more work is needed on finding the most efficient implementation.

TODOs:

Motivation

To make benchmarking and evaluating models for interleaved image-to-text tasks saner
To continue work on Multimodal In-and-Out, Interleaved Structured Generation: https://github.com/leloykun/mmsg

Your contribution

I've already started work on Chameleon & Anole here: #32013

But I'm currently blocked by (1) not having enough time due to other responsibilities and (2) not having enough compute resources.

Any help would be appreciated!

zucchini-nlp · 2024-08-22T07:16:15Z

FYI @NielsRogge and @merveenoyan , you've been discussing recently tags for these kinds of models on the hub

leloykun added the Feature request Request for a new feature label Aug 22, 2024

zucchini-nlp mentioned this issue Sep 8, 2024

Support Unified Multimodal Model #33368

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto model & pipeline for image-text-to-image-text models #32926

Auto model & pipeline for image-text-to-image-text models #32926

leloykun commented Aug 22, 2024 •

edited

Loading

zucchini-nlp commented Aug 22, 2024

Auto model & pipeline for image-text-to-image-text models #32926

Auto model & pipeline for image-text-to-image-text models #32926

Comments

leloykun commented Aug 22, 2024 • edited Loading

Feature request

Motivation

Your contribution

zucchini-nlp commented Aug 22, 2024

leloykun commented Aug 22, 2024 •

edited

Loading