Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Idefics 3! #32473

Merged
merged 53 commits into from
Sep 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
842a28d
Add Idefics 3!
andimarafioti Aug 6, 2024
afce007
fixes to make both pipelines identical
andimarafioti Aug 7, 2024
3e3b31d
fix for quantized models
andimarafioti Aug 8, 2024
9c8ffc4
First pass at the review
andimarafioti Aug 8, 2024
7e3d7a6
remove vocab size from the main config (it's still in the text_config)
andimarafioti Aug 8, 2024
dd99bca
hot fix for merve
andimarafioti Aug 8, 2024
ddac9ec
Apply suggestions from code review
andimarafioti Aug 9, 2024
188bb76
re-add model_type for text_config
andimarafioti Aug 9, 2024
43fb214
remove support for old_cache
andimarafioti Aug 9, 2024
c9e0d85
remove hidden_size from main config
andimarafioti Aug 9, 2024
1b2b89c
rename idefics3 HF repo
andimarafioti Aug 9, 2024
6ff766f
few changes suggested in the PR
andimarafioti Aug 12, 2024
11c2e1a
fix to input_data_format computation
andimarafioti Aug 12, 2024
c1048ed
remove overwrite of _autoset_attn_implementation following @zucchini-…
andimarafioti Aug 12, 2024
a163564
improve example
andimarafioti Aug 12, 2024
6f0a479
few improvements from amy's review
andimarafioti Aug 12, 2024
8361fce
big change to enable processing input images as numpy arrays
andimarafioti Aug 12, 2024
32970d0
Changes to the code to uniformize processor kwargs
andimarafioti Aug 13, 2024
c504f00
image processing tests
andimarafioti Aug 13, 2024
a914e41
image processing tests fixes and some bugs they discovered
andimarafioti Aug 13, 2024
6722d13
addressed review comments from Yoni
andimarafioti Aug 13, 2024
0533eda
fix modeling tests
andimarafioti Aug 13, 2024
b034091
remove special tokens that are not special
andimarafioti Aug 15, 2024
47fb7ce
fixes tests
andimarafioti Aug 15, 2024
4032a6f
skip failing tests - they also fail for idefics2
andimarafioti Aug 21, 2024
757e834
added paper and readded the tests with multi gpu, who knows
andimarafioti Aug 27, 2024
7797279
Update docs/source/en/model_doc/idefics3.md
andimarafioti Aug 30, 2024
b478124
Apply suggestions from code review
andimarafioti Aug 30, 2024
ada6219
review amy until image_processing_idefics3
andimarafioti Aug 30, 2024
164fbe8
last comments from Amy
andimarafioti Sep 2, 2024
000c8ea
review amy
andimarafioti Sep 6, 2024
4d02e0c
Update src/transformers/models/idefics3/image_processing_idefics3.py
andimarafioti Sep 4, 2024
3bf03c2
Update src/transformers/models/idefics3/modeling_idefics3.py
andimarafioti Sep 4, 2024
57bfd51
Update docs/source/en/model_doc/idefics3.md
andimarafioti Sep 6, 2024
63b1d7f
doc improvement - amy review
andimarafioti Sep 6, 2024
6325fbc
fix runtime error during fine-tuning
andimarafioti Sep 10, 2024
76b8892
amy's review
andimarafioti Sep 16, 2024
9a20306
Update src/transformers/models/idefics3/image_processing_idefics3.py
andimarafioti Sep 16, 2024
3129920
Update src/transformers/models/idefics3/image_processing_idefics3.py
andimarafioti Sep 16, 2024
e1a10b3
Update src/transformers/models/idefics3/modeling_idefics3.py
andimarafioti Sep 16, 2024
4c3756f
ruff
andimarafioti Sep 16, 2024
fbaf07e
amy's comment on the order
andimarafioti Sep 16, 2024
87fa179
ruff ruff
andimarafioti Sep 17, 2024
23d4cf8
fix copies
andimarafioti Sep 17, 2024
9e925b9
square images when they are not splitted
andimarafioti Sep 17, 2024
215b636
ruff :(
andimarafioti Sep 17, 2024
2967974
Update src/transformers/models/idefics3/image_processing_idefics3.py
andimarafioti Sep 18, 2024
ee041bf
Update tests/models/idefics3/test_processing_idefics3.py
andimarafioti Sep 18, 2024
4aad266
fix small bug introduced in refactor
andimarafioti Sep 18, 2024
f1ae8ae
amy's image processing changes
andimarafioti Sep 19, 2024
39d88b2
fixes peft tests and ruff
andimarafioti Sep 19, 2024
383f0db
modify to_pil_image from transformers. and review from emanuele.
andimarafioti Sep 23, 2024
682b82b
add modified to_pil_image
andimarafioti Sep 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -820,6 +820,8 @@
title: IDEFICS
- local: model_doc/idefics2
title: Idefics2
- local: model_doc/idefics3
title: Idefics3
- local: model_doc/instructblip
title: InstructBLIP
- local: model_doc/instructblipvideo
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,7 @@ Flax), PyTorch, and/or TensorFlow.
| [I-BERT](model_doc/ibert) | ✅ | ❌ | ❌ |
| [IDEFICS](model_doc/idefics) | ✅ | ✅ | ❌ |
| [Idefics2](model_doc/idefics2) | ✅ | ❌ | ❌ |
| [Idefics3](model_doc/idefics3) | ✅ | ❌ | ❌ |
| [ImageGPT](model_doc/imagegpt) | ✅ | ❌ | ❌ |
| [Informer](model_doc/informer) | ✅ | ❌ | ❌ |
| [InstructBLIP](model_doc/instructblip) | ✅ | ❌ | ❌ |
Expand Down
73 changes: 73 additions & 0 deletions docs/source/en/model_doc/idefics3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# Idefics3

## Overview

The Idefics3 model was proposed in [Building and better understanding vision-language models: insights and future directions](https://huggingface.co/papers/2408.12637) by Hugo Laurençon, Andrés Marafioti, Victor Sanh, and Léo Tronchon.

Idefics3 is an adaptation of the Idefics2 model with three main differences:

- It uses Llama3 for the text model.
- It uses an updated processing logic for the images.
- It removes the perceiver.

The abstract from the paper is the following:

*The field of vision-language models (VLMs), which take images and texts as inputs and output texts, is rapidly evolving and has yet to reach consensus on several key aspects of the development pipeline, including data, architecture, and training methods. This paper can be seen as a tutorial for building a VLM. We begin by providing a comprehensive overview of the current state-of-the-art approaches, highlighting the strengths and weaknesses of each, addressing the major challenges in the field, and suggesting promising research directions for underexplored areas. We then walk through the practical steps to build Idefics3-8B, a powerful VLM that significantly outperforms its predecessor Idefics2-8B, while being trained efficiently, exclusively on open datasets, and using a straightforward pipeline. These steps include the creation of Docmatix, a dataset for improving document understanding capabilities, which is 240 times larger than previously available datasets. We release the model along with the datasets created for its training.*

## Usage tips

Input images are processed either by upsampling (if resizing is enabled) or at their original resolution. The resizing behavior depends on two parameters: do_resize and size.

If `do_resize` is set to `True`, the model resizes images so that the longest edge is 4*364 pixels by default.
The default resizing behavior can be customized by passing a dictionary to the `size` parameter. For example, `{"longest_edge": 4 * 364}` is the default, but you can change it to a different value if needed.

Here’s how to control resizing and set a custom size:
```python
image_processor = Idefics3ImageProcessor(do_resize=True, size={"longest_edge": 2 * 364}, max_image_size=364)
```

Additionally, the `max_image_size` parameter, which controls the size of each square patch the image is decomposed into, is set to 364 by default but can be adjusted as needed. After resizing (if applicable), the image processor decomposes the images into square patches based on the `max_image_size` parameter.

This model was contributed by [amyeroberts](https://huggingface.co/amyeroberts) and [andimarafioti](https://huggingface.co/andito).


## Idefics3Config

[[autodoc]] Idefics3Config


## Idefics3Model

[[autodoc]] Idefics3Model
- forward

## Idefics3ForConditionalGeneration

[[autodoc]] Idefics3ForConditionalGeneration
- forward


## Idefics3ImageProcessor
[[autodoc]] Idefics3ImageProcessor
- preprocess


## Idefics3Processor
[[autodoc]] Idefics3Processor
- __call__
1 change: 1 addition & 0 deletions docs/source/en/perf_infer_gpu_one.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ FlashAttention-2 is currently supported for the following architectures:
* [GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj#transformers.GPTJModel)
* [Granite](https://huggingface.co/docs/transformers/model_doc/granite#transformers.GraniteModel)
* [Idefics2](https://huggingface.co/docs/transformers/model_doc/idefics2#transformers.Idefics2Model)
* [Idefics3](https://huggingface.co/docs/transformers/model_doc/idefics3#transformers.Idefics3Model)
* [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel)
* [JetMoe](https://huggingface.co/docs/transformers/model_doc/jetmoe#transformers.JetMoeModel)
* [Jamba](https://huggingface.co/docs/transformers/model_doc/jamba#transformers.JambaModel)
Expand Down
18 changes: 18 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -480,6 +480,7 @@
"models.ibert": ["IBertConfig"],
"models.idefics": ["IdeficsConfig"],
"models.idefics2": ["Idefics2Config"],
"models.idefics3": ["Idefics3Config"],
"models.imagegpt": ["ImageGPTConfig"],
"models.informer": ["InformerConfig"],
"models.instructblip": [
Expand Down Expand Up @@ -1180,6 +1181,7 @@
_import_structure["models.grounding_dino"].extend(["GroundingDinoImageProcessor"])
_import_structure["models.idefics"].extend(["IdeficsImageProcessor"])
_import_structure["models.idefics2"].extend(["Idefics2ImageProcessor"])
_import_structure["models.idefics3"].extend(["Idefics3ImageProcessor"])
_import_structure["models.imagegpt"].extend(["ImageGPTFeatureExtractor", "ImageGPTImageProcessor"])
_import_structure["models.instructblipvideo"].extend(["InstructBlipVideoImageProcessor"])
_import_structure["models.layoutlmv2"].extend(["LayoutLMv2FeatureExtractor", "LayoutLMv2ImageProcessor"])
Expand Down Expand Up @@ -2401,6 +2403,14 @@
"Idefics2Processor",
]
)
_import_structure["models.idefics3"].extend(
[
"Idefics3ForConditionalGeneration",
"Idefics3Model",
"Idefics3PreTrainedModel",
"Idefics3Processor",
]
)
_import_structure["models.imagegpt"].extend(
[
"ImageGPTForCausalImageModeling",
Expand Down Expand Up @@ -5247,6 +5257,7 @@
IdeficsConfig,
)
from .models.idefics2 import Idefics2Config
from .models.idefics3 import Idefics3Config
from .models.imagegpt import ImageGPTConfig
from .models.informer import InformerConfig
from .models.instructblip import (
Expand Down Expand Up @@ -5983,6 +5994,7 @@
from .models.grounding_dino import GroundingDinoImageProcessor
from .models.idefics import IdeficsImageProcessor
from .models.idefics2 import Idefics2ImageProcessor
from .models.idefics3 import Idefics3ImageProcessor
from .models.imagegpt import ImageGPTFeatureExtractor, ImageGPTImageProcessor
from .models.instructblipvideo import InstructBlipVideoImageProcessor
from .models.layoutlmv2 import (
Expand Down Expand Up @@ -7011,6 +7023,12 @@
Idefics2PreTrainedModel,
Idefics2Processor,
)
from .models.idefics3 import (
Idefics3ForConditionalGeneration,
Idefics3Model,
Idefics3PreTrainedModel,
Idefics3Processor,
)
from .models.imagegpt import (
ImageGPTForCausalImageModeling,
ImageGPTForImageClassification,
Expand Down
5 changes: 4 additions & 1 deletion src/transformers/image_transforms.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,7 @@ def _rescale_for_pil_conversion(image):
def to_pil_image(
image: Union[np.ndarray, "PIL.Image.Image", "torch.Tensor", "tf.Tensor", "jnp.ndarray"],
do_rescale: Optional[bool] = None,
image_mode: Optional[str] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
) -> "PIL.Image.Image":
"""
Expand All @@ -175,6 +176,8 @@ def to_pil_image(
Whether or not to apply the scaling factor (to make pixel values integers between 0 and 255). Will default
to `True` if the image type is a floating type and casting to `int` would result in a loss of precision,
and `False` otherwise.
image_mode (`str`, *optional*):
The mode to use for the PIL image. If unset, will use the default mode for the input image type.
input_data_format (`ChannelDimension`, *optional*):
The channel dimension format of the input image. If unset, will use the inferred format from the input.

Expand Down Expand Up @@ -207,7 +210,7 @@ def to_pil_image(
image = rescale(image, 255)

image = image.astype(np.uint8)
return PIL.Image.fromarray(image)
return PIL.Image.fromarray(image, mode=image_mode)


# Logic adapted from torchvision resizing logic: https://github.com/pytorch/vision/blob/511924c1ced4ce0461197e5caa64ce5b9e558aab/torchvision/transforms/functional.py#L366
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@
ibert,
idefics,
idefics2,
idefics3,
imagegpt,
informer,
instructblip,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@
("ibert", "IBertConfig"),
("idefics", "IdeficsConfig"),
("idefics2", "Idefics2Config"),
("idefics3", "Idefics3Config"),
("imagegpt", "ImageGPTConfig"),
("informer", "InformerConfig"),
("instructblip", "InstructBlipConfig"),
Expand Down Expand Up @@ -425,6 +426,7 @@
("ibert", "I-BERT"),
("idefics", "IDEFICS"),
("idefics2", "Idefics2"),
("idefics3", "Idefics3"),
("imagegpt", "ImageGPT"),
("informer", "Informer"),
("instructblip", "InstructBLIP"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@
("hiera", ("BitImageProcessor",)),
("idefics", ("IdeficsImageProcessor",)),
("idefics2", ("Idefics2ImageProcessor",)),
("idefics3", ("Idefics3ImageProcessor",)),
("imagegpt", ("ImageGPTImageProcessor",)),
("instructblip", ("BlipImageProcessor",)),
("instructblipvideo", ("InstructBlipVideoImageProcessor",)),
Expand Down
3 changes: 3 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@
("ibert", "IBertModel"),
("idefics", "IdeficsModel"),
("idefics2", "Idefics2Model"),
("idefics3", "Idefics3Model"),
("imagegpt", "ImageGPTModel"),
("informer", "InformerModel"),
("jamba", "JambaModel"),
Expand Down Expand Up @@ -311,6 +312,7 @@
("ibert", "IBertForMaskedLM"),
("idefics", "IdeficsForVisionText2Text"),
("idefics2", "Idefics2ForConditionalGeneration"),
("idefics3", "Idefics3ForConditionalGeneration"),
("layoutlm", "LayoutLMForMaskedLM"),
("llava", "LlavaForConditionalGeneration"),
("llava_next", "LlavaNextForConditionalGeneration"),
Expand Down Expand Up @@ -725,6 +727,7 @@
("chameleon", "ChameleonForConditionalGeneration"),
("git", "GitForCausalLM"),
("idefics2", "Idefics2ForConditionalGeneration"),
("idefics3", "Idefics3ForConditionalGeneration"),
("instructblip", "InstructBlipForConditionalGeneration"),
("instructblipvideo", "InstructBlipVideoForConditionalGeneration"),
("kosmos-2", "Kosmos2ForConditionalGeneration"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@
("hubert", "Wav2Vec2Processor"),
("idefics", "IdeficsProcessor"),
("idefics2", "Idefics2Processor"),
("idefics3", "Idefics3Processor"),
("instructblip", "InstructBlipProcessor"),
("instructblipvideo", "InstructBlipVideoProcessor"),
("kosmos-2", "Kosmos2Processor"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,7 @@
("ibert", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
("idefics", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("idefics2", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("idefics3", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("instructblip", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("instructblipvideo", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
(
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/idefics2/modeling_idefics2.py
Original file line number Diff line number Diff line change
Expand Up @@ -1097,7 +1097,7 @@ class Idefics2PreTrainedModel(PreTrainedModel):

def _init_weights(self, module):
std = (
self.config.text_config.initializer_range
self.config.initializer_range
if hasattr(self.config, "initializer_range")
else self.config.text_config.initializer_range
)
Expand Down
72 changes: 72 additions & 0 deletions src/transformers/models/idefics3/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available


_import_structure = {"configuration_idefics3": ["Idefics3Config"]}


try:
if not is_vision_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["image_processing_idefics3"] = ["Idefics3ImageProcessor"]


try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_idefics3"] = [
"Idefics3ForConditionalGeneration",
"Idefics3PreTrainedModel",
"Idefics3Model",
]
_import_structure["processing_idefics3"] = ["Idefics3Processor"]

if TYPE_CHECKING:
from .configuration_idefics3 import Idefics3Config

try:
if not is_vision_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .image_processing_idefics3 import Idefics3ImageProcessor

try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_idefics3 import (
Idefics3ForConditionalGeneration,
Idefics3Model,
Idefics3PreTrainedModel,
)
from .processing_idefics3 import Idefics3Processor


else:
import sys

sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
Loading
Loading