Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Frontend][Doc] Initial support for LLaVA-NeXT and GPT-4V Chat Completions API #3978

Closed

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Apr 10, 2024

To combat scope creep, this PR has been split into smaller ones.

The branch associated with this PR has been frozen (except for critical fixes). Once all dependencies have been merged, I will compare this branch against the merged (main) branch to verify that I didn't miss any changes.

- Refactor `OpenAIServingChat` and add function for loading image
- Move `pillow` dev dependency to common
- Add example chat template for LLaVA model
- Add general guide for using VLMs
- Add LLavA to list of supported models
- Move `ServerRunner` to common file
@DarkLight1337 DarkLight1337 changed the title [Doc][Frontend] Extexnd OpenAI-compatible server to support GPT-4V Chat Completions API [Doc][Frontend] Support GPT-4V Chat Completions API Apr 10, 2024
@ywang96 ywang96 self-assigned this Apr 10, 2024
@DarkLight1337 DarkLight1337 changed the title [Core][Frontend][Doc] Support image processing for VLMs and GPT-4V Chat Completions API [Core][Frontend][Doc] Improved support for VLMs and add GPT-4V Chat Completions API Apr 18, 2024
@DarkLight1337 DarkLight1337 changed the title [Core][Frontend][Doc] Improved support for VLMs and add GPT-4V Chat Completions API [Core][Frontend][Doc] Initial support for LLaVA-NeXT and add GPT-4V Chat Completions API Apr 18, 2024
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Apr 18, 2024

I have just added support for LLaVA-NeXT, with one big caveat: the size of the input image is fixed, otherwise the feature size (i.e. number of <image> tokens to duplicate) would vary depending on the runtime input. This prevents us from taking full advantage of the extra resolution. Still, this provides us access to a 34b model which should improve over their 7b and 13b LLaVA-1.5 models.

@DarkLight1337 DarkLight1337 changed the title [Core][Frontend][Doc] Initial support for LLaVA-NeXT and add GPT-4V Chat Completions API [Core][Frontend][Doc] Initial support for LLaVA-NeXT and GPT-4V Chat Completions API Apr 18, 2024
@DarkLight1337 DarkLight1337 force-pushed the openai-vision-api branch 5 times, most recently from f66a08f to 72eb712 Compare April 19, 2024 05:01
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Apr 19, 2024

These force pushes consolidate the fixes to the LLaVA test and example code.

- Note that we now load the images directly instead of from `.pt` files
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Apr 19, 2024

@ywang96 I think that this PR is suffering from scope creep. Perhaps I should break apart the changes into smaller segments to facilitate the conversation in #4194? I could split the changes as follows, with each item being its own PR:

  1. VLM backend
    a. Refactor MultiModalData to support image processing; refactor LLaVA-1.5 accordingly. ([Core] Support image processor #4197)
    b. Introduce LLaVA-NexT along with the refactored LLaVA-1.5 ([Model] Initial support for LLaVA-NeXT #4199) [depends on 1(a)]
  2. OpenAI API server
    a. Refactor OpenAPI backend ([Frontend] Refactor prompt processing #4028)
    b. Add GPT-4V support and provide LLaVA chat template ([Frontend] Support GPT-4V Chat Completions API #4200) [depends on 1(a) 2(a)]

Edit: Added links to the child PRs.

@ywang96
Copy link
Member

ywang96 commented Apr 19, 2024

@ywang96 I think that this PR is suffering from scope creep. Perhaps I should break apart the changes into smaller segments to facilitate the conversation in #4194? I could split the changes as follows (listed in the form of a dependency tree):

  1. VLM backend
    a. Refactor MultiModalData to support image processing.
    b. Introduce LLaVA-NexT along with the refactored LLaVA-1.5
  2. OpenAI API server
    a. Refactor OpenAPI backend (i.e. Support VLM model and GPT4V API #2058)
    b. Add GPT-4V support [also depends on 1(a)]

I agree - I think OpenAI API server will be a good starting point since the interface should agree with OpenAI protocol anyways, and I'm sorry that this PR suffered :/

One suggestion I have is for a big change like this - it's probably good to have a series of PRs anyways. Take a look at Speculative decoding or Chunked Prefill - those are great examples.

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Apr 19, 2024

I have created the child PRs.

- These changes are propagated to the child PRs
@DarkLight1337
Copy link
Member Author

All of the child PRs have been completed, so I'm closing this now.

@DarkLight1337 DarkLight1337 deleted the openai-vision-api branch June 10, 2024 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants