huggingface · danaaubakirova · May 27, 2024 · May 27, 2024 · May 27, 2024 · May 28, 2024
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -818,6 +818,8 @@
         title: MatCha
       - local: model_doc/mgp-str
         title: MGP-STR
+      - local: model_doc/mplugdocowl
+        title: mPLUGDocOwl
       - local: model_doc/nougat
         title: Nougat
       - local: model_doc/oneformer

diff --git a/docs/source/en/index.md b/docs/source/en/index.md
@@ -214,6 +214,7 @@ Flax), PyTorch, and/or TensorFlow.
 |                  [MobileNetV2](model_doc/mobilenet_v2)                   |       ✅        |         ❌         |      ❌      |
 |                     [MobileViT](model_doc/mobilevit)                     |       ✅        |         ✅         |      ❌      |
 |                   [MobileViTV2](model_doc/mobilevitv2)                   |       ✅        |         ❌         |      ❌      |
+|                   [mPLUGDocOwl](model_doc/mplugdocowl)                   |       ✅        |         ❌         |      ❌      |
 |                         [MPNet](model_doc/mpnet)                         |       ✅        |         ✅         |      ❌      |
 |                           [MPT](model_doc/mpt)                           |       ✅        |         ❌         |      ❌      |
 |                           [MRA](model_doc/mra)                           |       ✅        |         ❌         |      ❌      |

diff --git a/docs/source/en/model_doc/mplugdocowl.md b/docs/source/en/model_doc/mplugdocowl.md
@@ -0,0 +1,75 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# mPLUG-DocOwl1.5
+
+## Overview
+
+The mPLUG-DocOwl1.5 model was proposed in [mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding](https://arxiv.org/pdf/2403.12895) by Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan
+Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou.
+
+MPLUG-DocOwl1.5 is a multimodal model designed for text-rich images. It features the H-Reducer vision-to-text module, which preserves spatial relationships and efficiently processes high-resolution document images by merging visual features horizontally.
+
+The model employs Unified Structure Learning with structure-aware parsing tasks and multi-grained text localization tasks, teaching it to parse text using line feeds, spaces, and extended Markdown syntax, which enhances the model's ability to correlate text with specific positions in the image.
+
+DocOwl 1.5 undergoes a two-stage training process: Unified Structure Learning followed by Multi-task Tuning among Downstream Tasks. The high-quality DocReason25K dataset boosts reasoning abilities, allowing DocOwl 1.5-Chat to balance concise answers and detailed explanations.
+
+The abstract from the paper is the following:
+
+*Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs. Our Unified Structure Learning comprises structure-aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Furthermore, by constructing structure-aware text sequences and multi-grained pairs of texts and bounding boxes for publicly available text-rich images, we build a comprehensive training set DocStruct4M to support structure learning. Finally, we construct a small but high-quality reasoning tuning dataset DocReason25K to trigger the detailed explanation ability in the document domain. Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the SOTA performance of MLLMs with a 7B LLM by more than 10 points in 5/10 benchmarks.*
+
+Tips:
+
+DocOowl-Chat: For more accurate and stable generation, set do_sample=False. Performs better on most of the samples compared to the DocOwl-Omni checkpoint. 
+DocOwl-Omni: For optimal performance, use do_sample=True and top_p=0.7 as recommended in the original code.
+
+This model was contributed by [danaaubakirova](https://huggingface.co/danaaubakirova).
+The original code can be found [here](https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5).
+
+
+## MPLUGDocOwlConfig
+
+[[autodoc]] MPLUGDocOwlConfig
+
+## MPLUGDocOwlImageProcessor
+[[autodoc]] MPLUGDocOwlImageProcessor
+
+## MPLUGDocOwlProcessor
+[[autodoc]] MPLUGDocOwlProcessor
+
+## MPLUGDocOwlHReducer
+[[autodoc]] MPLUGDocOwlHReducer
+
+## MPLUGDocOwlForCausalLM
+[[autodoc]] MPLUGDocOwlForCausalLM
+    - forward
+
+## MPLUGDocOwlLanguageModel
+[[autodoc]] MPLUGDocOwlLanguageModel
+
+## MPLUGDocOwlPreTrainedLanguageModel
+[[autodoc]] MPLUGDocOwlPreTrainedLanguageModel
+
+## MPLUGDocOwlVisionModel
+[[autodoc]] MPLUGDocOwlVisionModel
+
+## MPLUGDocOwlVisionTransformer
+[[autodoc]] MPLUGDocOwlVisionTransformer
+
+## MPLUGDocOwlForConditionalGeneration
+
+[[autodoc]] MPLUGDocOwlForConditionalGeneration
+    - forward
diff --git a/examples_multi_col_60204.png b/examples_multi_col_60204.png
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
@@ -576,6 +576,10 @@
     "models.mobilenet_v2": ["MobileNetV2Config"],
     "models.mobilevit": ["MobileViTConfig"],
     "models.mobilevitv2": ["MobileViTV2Config"],
+    "models.mplugdocowl": [
+        "MPLUGDocOwlConfig",
+        "MPLUGDocOwlProcessor",
+    ],
     "models.mpnet": [
         "MPNetConfig",
         "MPNetTokenizer",
@@ -1170,6 +1174,7 @@
     _import_structure["models.mobilenet_v1"].extend(["MobileNetV1FeatureExtractor", "MobileNetV1ImageProcessor"])
     _import_structure["models.mobilenet_v2"].extend(["MobileNetV2FeatureExtractor", "MobileNetV2ImageProcessor"])
     _import_structure["models.mobilevit"].extend(["MobileViTFeatureExtractor", "MobileViTImageProcessor"])
+    _import_structure["models.mplugdocowl"].extend(["MPLUGDocOwlImageProcessor"])
     _import_structure["models.nougat"].append("NougatImageProcessor")
     _import_structure["models.oneformer"].extend(["OneFormerImageProcessor"])
     _import_structure["models.owlv2"].append("Owlv2ImageProcessor")
@@ -2667,6 +2672,19 @@
             "MobileViTV2PreTrainedModel",
         ]
     )
+    _import_structure["models.mplugdocowl"].extend(
+        [
+            "MPLUGDocOwlAttention",
+            "MPLUGDocOwlForCausalLM",
+            "MPLUGDocOwlForConditionalGeneration",
+            "MPLUGDocOwlHReducer",
+            "MPLUGDocOwlLanguageModel",
+            "MPLUGDocOwlPreTrainedLanguageModel",
+            "MPLUGDocOwlPreTrainedModel",
+            "MPLUGDocOwlVisionModel",
+            "MPLUGDocOwlVisionTransformer",
+        ]
+    )
     _import_structure["models.mpnet"].extend(
         [
             "MPNetForMaskedLM",
@@ -5266,6 +5284,10 @@
     from .models.mobilevitv2 import (
         MobileViTV2Config,
     )
+    from .models.mplugdocowl import (
+        MPLUGDocOwlConfig,
+        MPLUGDocOwlProcessor,
+    )
     from .models.mpnet import (
         MPNetConfig,
         MPNetTokenizer,
@@ -5895,6 +5917,7 @@
             MobileNetV2ImageProcessor,
         )
         from .models.mobilevit import MobileViTFeatureExtractor, MobileViTImageProcessor
+        from .models.mplugdocowl import MPLUGDocOwlImageProcessor
         from .models.nougat import NougatImageProcessor
         from .models.oneformer import OneFormerImageProcessor
         from .models.owlv2 import Owlv2ImageProcessor
@@ -7122,6 +7145,17 @@
             MobileViTV2Model,
             MobileViTV2PreTrainedModel,
         )
+        from .models.mplugdocowl import (
+            MPLUGDocOwlAttention,
+            MPLUGDocOwlForCausalLM,
+            MPLUGDocOwlForConditionalGeneration,
+            MPLUGDocOwlHReducer,
+            MPLUGDocOwlLanguageModel,
+            MPLUGDocOwlPreTrainedLanguageModel,
+            MPLUGDocOwlPreTrainedModel,
+            MPLUGDocOwlVisionModel,
+            MPLUGDocOwlVisionTransformer,
+        )
         from .models.mpnet import (
             MPNetForMaskedLM,
             MPNetForMultipleChoice,

diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py
@@ -152,6 +152,7 @@
     mobilenet_v2,
     mobilevit,
     mobilevitv2,
+    mplugdocowl,
     mpnet,
     mpt,
     mra,

diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py
@@ -169,6 +169,7 @@
         ("mobilenet_v2", "MobileNetV2Config"),
         ("mobilevit", "MobileViTConfig"),
         ("mobilevitv2", "MobileViTV2Config"),
+        ("mplugdocowl", "MPLUGDocOwlConfig"),
         ("mpnet", "MPNetConfig"),
         ("mpt", "MptConfig"),
         ("mra", "MraConfig"),
@@ -461,6 +462,7 @@
         ("mobilenet_v2", "MobileNetV2"),
         ("mobilevit", "MobileViT"),
         ("mobilevitv2", "MobileViTV2"),
+        ("mplugdocowl", "mPLUGDocOwl"),
         ("mpnet", "MPNet"),
         ("mpt", "MPT"),
         ("mra", "MRA"),

diff --git a/src/transformers/models/auto/image_processing_auto.py b/src/transformers/models/auto/image_processing_auto.py
@@ -106,6 +106,7 @@
             ("mobilenet_v2", ("MobileNetV2ImageProcessor",)),
             ("mobilevit", ("MobileViTImageProcessor",)),
             ("mobilevitv2", ("MobileViTImageProcessor",)),
+            ("mplugdocowl", ("MPLUGDocOwlImageProcessor",)),
             ("nat", ("ViTImageProcessor", "ViTImageProcessorFast")),
             ("nougat", ("NougatImageProcessor",)),
             ("oneformer", ("OneFormerImageProcessor",)),

diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py
@@ -312,6 +312,7 @@
         ("mega", "MegaForMaskedLM"),
         ("megatron-bert", "MegatronBertForPreTraining"),
         ("mobilebert", "MobileBertForPreTraining"),
+        ("mplugdocowl", "MPLUGDocOwlForConditionalGeneration"),
         ("mpnet", "MPNetForMaskedLM"),
         ("mpt", "MptForCausalLM"),
         ("mra", "MraForMaskedLM"),
@@ -711,6 +712,7 @@
         ("llava", "LlavaForConditionalGeneration"),
         ("llava-next-video", "LlavaNextVideoForConditionalGeneration"),
         ("llava_next", "LlavaNextForConditionalGeneration"),
+        ("mplugdocowl", "MPLUGDocOwlForConditionalGeneration"),
         ("paligemma", "PaliGemmaForConditionalGeneration"),
         ("pix2struct", "Pix2StructForConditionalGeneration"),
         ("video_llava", "VideoLlavaForConditionalGeneration"),

diff --git a/src/transformers/models/auto/processing_auto.py b/src/transformers/models/auto/processing_auto.py
@@ -76,6 +76,7 @@
         ("markuplm", "MarkupLMProcessor"),
         ("mctct", "MCTCTProcessor"),
         ("mgp-str", "MgpstrProcessor"),
+        ("mplugdocowl", "MPLUGDocOwlProcessor"),
         ("oneformer", "OneFormerProcessor"),
         ("owlv2", "Owlv2Processor"),
         ("owlvit", "OwlViTProcessor"),

diff --git a/src/transformers/models/mplugdocowl/__init__.py b/src/transformers/models/mplugdocowl/__init__.py
@@ -0,0 +1,107 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
+
+
+_import_structure = {
+    "configuration_mplugdocowl": ["MPLUGDocOwlConfig"],
+    "modeling_mplugdocowl": [
+        "MPLUGDocOwlAttention",
+        "MPLUGDocOwlForCausalLM",
+        "MPLUGDocOwlForConditionalGeneration",
+        "MPLUGDocOwlHReducer",
+        "MPLUGDocOwlLanguageModel",
+        "MPLUGDocOwlPreTrainedLanguageModel",
+        "MPLUGDocOwlPreTrainedModel",
+        "MPLUGDocOwlVisionModel",
+        "MPLUGDocOwlVisionTransformer",
+    ],
+    "processing_mplugdocowl": ["MPLUGDocOwlProcessor"],
+}
+
+try:
+    if not is_vision_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["image_processing_mplugdocowl"] = ["MPLUGDocOwlImageProcessor"]
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_mplugdocowl"] = [
+        "MPLUGDocOwlAttention",
+        "MPLUGDocOwlForCausalLM",
+        "MPLUGDocOwlForConditionalGeneration",
+        "MPLUGDocOwlHReducer",
+        "MPLUGDocOwlLanguageModel",
+        "MPLUGDocOwlPreTrainedLanguageModel",
+        "MPLUGDocOwlPreTrainedModel",
+        "MPLUGDocOwlVisionModel",
+        "MPLUGDocOwlVisionTransformer",
+    ]
+
+
+if TYPE_CHECKING:
+    from .configuration_mplugdocowl import MPLUGDocOwlConfig
+    from .modeling_mplugdocowl import (
+        MPLUGDocOwlAttention,
+        MPLUGDocOwlForCausalLM,
+        MPLUGDocOwlForConditionalGeneration,
+        MPLUGDocOwlHReducer,
+        MPLUGDocOwlLanguageModel,
+        MPLUGDocOwlPreTrainedLanguageModel,
+        MPLUGDocOwlPreTrainedModel,
+        MPLUGDocOwlVisionModel,
+        MPLUGDocOwlVisionTransformer,
+    )
+    from .processing_mplugdocowl import MPLUGDocOwlProcessor
+
+    try:
+        if not is_vision_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .image_processing_mplugdocowl import MPLUGDocOwlImageProcessor
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_mplugdocowl import (
+            MPLUGDocOwlAttention,
+            MPLUGDocOwlForCausalLM,
+            MPLUGDocOwlForConditionalGeneration,
+            MPLUGDocOwlHReducer,
+            MPLUGDocOwlLanguageModel,
+            MPLUGDocOwlPreTrainedLanguageModel,
+            MPLUGDocOwlPreTrainedModel,
+            MPLUGDocOwlVisionModel,
+            MPLUGDocOwlVisionTransformer,
+        )
+
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)