Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding mplugdocowl #31792

Open
wants to merge 98 commits into
base: main
Choose a base branch
from

Conversation

danaaubakirova
Copy link

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

danaaubakirova and others added 30 commits May 27, 2024 09:35
…owl.py


fix: removed cos, sin cached

Co-authored-by: Pablo Montalvo <[email protected]>
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@jp1924
Copy link
Contributor

jp1924 commented Jul 23, 2024

Hi there, @zucchini-nlp @danaaubakirova

Would it be okay if I also participate in this PR?
I have experience with writing the Processor and code for the previous version of DocOwl, UReader, so I believe I can be of help.

Additionally, when coding the Processor and Model, how about referring to Llava-NEXT?
Like DocOWL, Llava-NEXT divides images into patches for resolution, so you might find solutions to various issues you’ve encountered while implementing DocOWL on HuggingFace.

Thank you.

@danaaubakirova
Copy link
Author

Hello @jp1924,

Thank you for reaching out and for your suggestions. This PR is almost complete. However, I look forward to collaborating with you next time.

Best,

@jp1924
Copy link
Contributor

jp1924 commented Jul 25, 2024

Thank you for considering it! I hope we have the opportunity to collaborate together in the future!

@molbap molbap self-requested a review July 29, 2024 07:51
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@ArthurZucker
Copy link
Collaborator

cc @zucchini-nlp not sur what is the sate of this PR, were you waiting for @molbap 's review (he is off for 2 more weeks I think))

@danaaubakirova
Copy link
Author

@ArthurZucker Hello! Yes, I completed this around 1-2 months ago notified @molbap , and was waiting for his review.

@ArthurZucker
Copy link
Collaborator

Hey! @zucchini-nlp will have an other look before I do the final review! 🤗 (pablo is 🌴 off for a while)

@zucchini-nlp
Copy link
Member

@danaaubakirova I added the expansion logic in processors so now we don't need the merge_inputs method in modeling code. I didn't review the PR but some general comments from what I saw during adding the code.

  • I added cache_position in one Module, but we need to pass it in further to the LM so it is actually used to update the cache. Can help with that one if you are low on bandwidth :)
  • The modality_indicators can be prepared in the processor logic and we can call it token_type_ids. It is very similar to what we did in CogVLM and somewhat to Paligemma. Would be a lot cleaner that way
  • I am not sure if the model is expected to work when no image is passed, because the processing assign None to pixels and later tries to get keys from the None dict. Can you check pls, as the current logic I implemented also doesn't expected no-image inputs?
  • I changed the order of inputs for processor to be image, text and not text, image but we still need to use the new standardized ProcessorKwargs as noted by @molbap earlier
  • Lastly, we need to add GenerationTesterMixin because the latest VLMs should have no problem at enabling them. See Llava Onevision: add model #32673 for reference
  • Overall clean up on docstring and formatting is needed before review, e.g. there are two MPLUGDOCOWL_INPUTS_DOCSTRING in difference cases

Let me know if you need any help or PR is ready for review after the above comments are addressed 🤗

@molbap
Copy link
Contributor

molbap commented Sep 18, 2024

Hey @danaaubakirova @zucchini-nlp ! I can answer on (very) few points here

  • cache_position passing should be doable yes. Decoder is llama-like.
  • modality_indicators to token_type_ids, I'm not sure it's a 1-to-1 mapping, might be tricky to do.
  • 90% sure the model always expects an image. It's a document analysis model first and foremost, not a text-only conversational model.
  • processors kwargs uniformization is doable. I can help a bit on that.
  • for the generation tester mixin, got to go test by test and see what fails. Not sure what is robust right now.
  • I can help on docstring formatting.
    Overall this looks veryyy close to be done, let's push it over the finish line!

@zucchini-nlp
Copy link
Member

for the generation tester mixin, got to go test by test and see what fails. Not sure what is robust right now.

We're almost there, current tests should work in text-only cases. For ex when a LLaVA generates from text and no images. For multimodality opened a PR just today, so I hope it will be there soon (#33533)

LMK if some things start failing, it's always better to know why it fails to see if we need fixes on generation side :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants