Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto model & pipeline for image-text-to-image-text models #32926

Open
2 of 14 tasks
leloykun opened this issue Aug 22, 2024 · 1 comment
Open
2 of 14 tasks

Auto model & pipeline for image-text-to-image-text models #32926

leloykun opened this issue Aug 22, 2024 · 1 comment
Labels
Feature request Request for a new feature

Comments

@leloykun
Copy link
Contributor

leloykun commented Aug 22, 2024

Feature request

This is a tracker issue for work on interleaved in-and-out image-text generation.

There are now >= 5 open-source models that can do interleaved image-text generation--and many more are expected to be released. Thus, it would now be practical & useful for us to (1) add native support for such models and (2) standardize the logic flow of data through processors and pipelines as done in #31911 and #32472

Model Github Notes PR
Anole https://github.com/GAIR-NLP/anole - #32013
Chameleon https://github.com/facebookresearch/chameleon - #32013
Llava-NeXT-Interleaved https://github.com/LLaVA-VL/LLaVA-NeXT - -
Lumina-mGPT https://github.com/Alpha-VLLM/Lumina-mGPT - -
Show-o https://github.com/showlab/Show-o - -
Transfusion - Not open-source (yet, perhaps) -
XGen-MM https://github.com/salesforce/LAVIS/tree/xgen-mm The paper & the github repo don't actually demonstrate interleaved image-text generation yet, but they did train the model on such datasets & the model architecture(s) is perfectly suited for it -

Initial work for Chameleon & Anole can be found here: #32013 for reference.

Notes:

  • We explicitly exclude models that can only do text-only generation or image-only generation. We also exclude models that can do image-text generation but not in an interleaved manner.
  • As I've demonstrated in my repo, explicitly implementing the Finite State Machine (FSM) for switching between text-generation and image-generation modes as done in Chameleon's repo is not necessary. Implicitly implementing the FSM with Logits Processors suffices. Although more work is needed on finding the most efficient implementation.

TODOs:

Motivation

  1. To make benchmarking and evaluating models for interleaved image-to-text tasks saner
  2. To continue work on Multimodal In-and-Out, Interleaved Structured Generation: https://github.com/leloykun/mmsg

Your contribution

I've already started work on Chameleon & Anole here: #32013

But I'm currently blocked by (1) not having enough time due to other responsibilities and (2) not having enough compute resources.

Any help would be appreciated!

@leloykun leloykun added the Feature request Request for a new feature label Aug 22, 2024
@zucchini-nlp
Copy link
Member

FYI @NielsRogge and @merveenoyan , you've been discussing recently tags for these kinds of models on the hub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants