Adding mplugdocowl #31792

danaaubakirova · 2024-07-04T09:53:39Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…DocOwl Vision

…ngs in VisionModel

…wl.py Co-authored-by: Pablo Montalvo <[email protected]>

…into mplugdocowl

…classname to inits

…sue with default_to_square

…owl.py fix: removed cos, sin cached Co-authored-by: Pablo Montalvo <[email protected]>

…rds works.

…wl.py Co-authored-by: Raushan Turganbay <[email protected]>

HuggingFaceDocBuilderDev · 2024-07-17T15:23:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

jp1924 · 2024-07-23T04:30:04Z

Hi there, @zucchini-nlp @danaaubakirova

Would it be okay if I also participate in this PR?
I have experience with writing the Processor and code for the previous version of DocOwl, UReader, so I believe I can be of help.

Additionally, when coding the Processor and Model, how about referring to Llava-NEXT?
Like DocOWL, Llava-NEXT divides images into patches for resolution, so you might find solutions to various issues you’ve encountered while implementing DocOWL on HuggingFace.

Thank you.

danaaubakirova · 2024-07-24T08:18:55Z

Hello @jp1924,

Thank you for reaching out and for your suggestions. This PR is almost complete. However, I look forward to collaborating with you next time.

Best,

jp1924 · 2024-07-25T02:24:58Z

Thank you for considering it! I hope we have the opportunity to collaborate together in the future!

github-actions · 2024-08-26T08:04:58Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker · 2024-08-26T11:08:07Z

cc @zucchini-nlp not sur what is the sate of this PR, were you waiting for @molbap 's review (he is off for 2 more weeks I think))

danaaubakirova · 2024-08-26T11:33:13Z

@ArthurZucker Hello! Yes, I completed this around 1-2 months ago notified @molbap , and was waiting for his review.

ArthurZucker · 2024-08-26T13:49:38Z

Hey! @zucchini-nlp will have an other look before I do the final review! 🤗 (pablo is 🌴 off for a while)

zucchini-nlp · 2024-09-03T11:12:11Z

@danaaubakirova I added the expansion logic in processors so now we don't need the merge_inputs method in modeling code. I didn't review the PR but some general comments from what I saw during adding the code.

I added cache_position in one Module, but we need to pass it in further to the LM so it is actually used to update the cache. Can help with that one if you are low on bandwidth :)
The modality_indicators can be prepared in the processor logic and we can call it token_type_ids. It is very similar to what we did in CogVLM and somewhat to Paligemma. Would be a lot cleaner that way
I am not sure if the model is expected to work when no image is passed, because the processing assign None to pixels and later tries to get keys from the None dict. Can you check pls, as the current logic I implemented also doesn't expected no-image inputs?
I changed the order of inputs for processor to be image, text and not text, image but we still need to use the new standardized ProcessorKwargs as noted by @molbap earlier
Lastly, we need to add GenerationTesterMixin because the latest VLMs should have no problem at enabling them. See Llava Onevision: add model #32673 for reference
Overall clean up on docstring and formatting is needed before review, e.g. there are two MPLUGDOCOWL_INPUTS_DOCSTRING in difference cases

Let me know if you need any help or PR is ready for review after the above comments are addressed 🤗

molbap · 2024-09-18T14:13:50Z

Hey @danaaubakirova @zucchini-nlp ! I can answer on (very) few points here

cache_position passing should be doable yes. Decoder is llama-like.
modality_indicators to token_type_ids, I'm not sure it's a 1-to-1 mapping, might be tricky to do.
90% sure the model always expects an image. It's a document analysis model first and foremost, not a text-only conversational model.
processors kwargs uniformization is doable. I can help a bit on that.
for the generation tester mixin, got to go test by test and see what fails. Not sure what is robust right now.
I can help on docstring formatting.
Overall this looks veryyy close to be done, let's push it over the finish line!

zucchini-nlp · 2024-09-18T14:22:56Z

for the generation tester mixin, got to go test by test and see what fails. Not sure what is robust right now.

We're almost there, current tests should work in text-only cases. For ex when a LLaVA generates from text and no images. For multimodality opened a PR just today, so I hope it will be there soon (#33533)

LMK if some things start failing, it's always better to know why it fails to see if we need fixes on generation side :)

danaaubakirova and others added 30 commits May 27, 2024 09:35

feat: adding mplugdocowl

b311e5e

feat: added separate file for the mPLUGDocOwl language model

aa0ec04

feat: added vision encoder for mplugdocowl

cc7e9b3

fix: changed the attention mechanism in clip vision, renamed to MPLUG…

204daba

…DocOwl Vision

feat: added hreducer and new things in config, changed vision embeddi…

6e144e5

…ngs in VisionModel

fix: converted hreducer module related tensors to contiguous

9f94d2c

feat: added shape adaptive module

19ffc83

feat: added new image_processing script

85dce8d

Update src/transformers/models/mplugdocowl/image_processing_mplugdoco…

0f5fb87

…wl.py Co-authored-by: Pablo Montalvo <[email protected]>

fix: small fix

53aca6d

Merge branch 'mplugdocowl' of github.com:danaaubakirova/transformers …

cb25b05

…into mplugdocowl

feat: added the additional keys to the output of the data

1debae3

feat: made major modifications to image_processing script. added the …

66b849d

…classname to inits

feat: refactored shape_adaptive_cropping function and resolved the is…

1716668

…sue with default_to_square

feat: testing forward

452ebf5

feat: corrected image tag

1e7f386

fix: attention mask handling is fixed, .forward works

8577f35

feat: updates in vision architecture

f546fbc

Update src/transformers/models/mplugdocowl/language_modeling_mplugdoc…

edc358d

…owl.py fix: removed cos, sin cached Co-authored-by: Pablo Montalvo <[email protected]>

fix: renaming the model

9003d59

grand fix: fixed hreducer, the firstgenerated token is correct. forwa…

9f688d9

…rds works.

fix: need to fix prepare_inputs_for_generation()

30c8a2b

fix: fixed prepare_inputs_for_generation()

5483f82

Merge branch 'main' into mplugdocowl

413ddad

testing phase

7546063

removed copied from ..

e3cc222

small fixes

4f4f219

removed some things from the config

661bd75

small fixes

8aded38

update

19e0a35

danaaubakirova and others added 8 commits July 17, 2024 16:40

fixes

102f5f6

Update src/transformers/models/mplugdocowl/image_processing_mplugdoco…

87c40b3

…wl.py Co-authored-by: Raushan Turganbay <[email protected]>

Update src/transformers/models/mplugdocowl/image_processing_mplugdoco…

3aa4635

…wl.py Co-authored-by: Raushan Turganbay <[email protected]>

Update src/transformers/models/mplugdocowl/image_processing_mplugdoco…

4b87998

…wl.py Co-authored-by: Raushan Turganbay <[email protected]>

Update src/transformers/models/mplugdocowl/image_processing_mplugdoco…

da5411d

…wl.py Co-authored-by: Raushan Turganbay <[email protected]>

Update src/transformers/models/mplugdocowl/image_processing_mplugdoco…

cb02ee6

…wl.py Co-authored-by: Raushan Turganbay <[email protected]>

fix of the accepted commits.

47f552d

fix

c2837ae

danaaubakirova added 2 commits July 19, 2024 11:47

update, aded kwargs and support for quantization

6a48b47

update

49acffb

Merge branch 'main' into adding_mplugdocowl

f2fed0d

molbap self-requested a review July 29, 2024 07:51

danaaubakirova added 6 commits July 31, 2024 16:45

resolving comments, small fixes

dba858e

fixup

387beb9

Merge branch 'main' into adding_mplugdocowl

96d5c6e

Merge branch 'main' into adding_mplugdocowl

7f0a993

copies fix

389d049

doc fix

8b5451a

add expansion logic in processors

cddfbdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding mplugdocowl #31792

Adding mplugdocowl #31792

danaaubakirova commented Jul 4, 2024

HuggingFaceDocBuilderDev commented Jul 17, 2024

jp1924 commented Jul 23, 2024

danaaubakirova commented Jul 24, 2024

jp1924 commented Jul 25, 2024

github-actions bot commented Aug 26, 2024

ArthurZucker commented Aug 26, 2024

danaaubakirova commented Aug 26, 2024

ArthurZucker commented Aug 26, 2024

zucchini-nlp commented Sep 3, 2024

molbap commented Sep 18, 2024

zucchini-nlp commented Sep 18, 2024

Adding mplugdocowl #31792

Are you sure you want to change the base?

Adding mplugdocowl #31792

Conversation

danaaubakirova commented Jul 4, 2024

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Jul 17, 2024

jp1924 commented Jul 23, 2024

danaaubakirova commented Jul 24, 2024

jp1924 commented Jul 25, 2024

github-actions bot commented Aug 26, 2024

ArthurZucker commented Aug 26, 2024

danaaubakirova commented Aug 26, 2024

ArthurZucker commented Aug 26, 2024

zucchini-nlp commented Sep 3, 2024

molbap commented Sep 18, 2024

zucchini-nlp commented Sep 18, 2024