From 5df7f32de496154c39b9a6d0b68ed089f8404916 Mon Sep 17 00:00:00 2001 From: Sungmin Oh Date: Fri, 20 Sep 2024 16:32:34 +0900 Subject: [PATCH 1/2] docs: ko: model_doc/paligemma.md --- docs/source/ko/_toctree.yml | 2 + docs/source/ko/model_doc/paligemma.md | 78 +++++++++++++++++++++++++++ 2 files changed, 80 insertions(+) create mode 100644 docs/source/ko/model_doc/paligemma.md diff --git a/docs/source/ko/_toctree.yml b/docs/source/ko/_toctree.yml index eafd389994ad52..1757af18168a4a 100644 --- a/docs/source/ko/_toctree.yml +++ b/docs/source/ko/_toctree.yml @@ -705,6 +705,8 @@ title: (번역중) OneFormer - local: in_translation title: (번역중) OWL-ViT + - local: model_doc/paligemma + title: PaliGemma - local: in_translation title: (번역중) Perceiver - local: in_translation diff --git a/docs/source/ko/model_doc/paligemma.md b/docs/source/ko/model_doc/paligemma.md new file mode 100644 index 00000000000000..db68d53300d80b --- /dev/null +++ b/docs/source/ko/model_doc/paligemma.md @@ -0,0 +1,78 @@ + + +# PaliGemma[[paligemma]] + +## Overview[[overview]] + +The PaliGemma model was proposed in [PaliGemma – Google's Cutting-Edge Open Vision Language Model](https://huggingface.co/blog/paligemma) by Google. It is a 3B vision-language model composed by a [SigLIP](siglip) vision encoder and a [Gemma](gemma) language decoder linked by a multimodal linear projection. It cuts an image into a fixed number of VIT tokens and prepends it to an optional prompt. One particularity is that the model uses full block attention on all the image tokens plus the input text tokens. It comes in 3 resolutions, 224x224, 448x448 and 896x896 with 3 base models, with 55 fine-tuned versions for different tasks, and 2 mix models. + + + + PaliGemma architecture. Taken from the blog post. + +This model was contributed by [Molbap](https://huggingface.co/Molbap). + +## Usage tips[[usage-tips]] + +Inference with PaliGemma can be performed as follows: + +```python +from transformers import AutoProcessor, PaliGemmaForConditionalGeneration + +model_id = "google/paligemma-3b-mix-224" +model = PaliGemmaForConditionalGeneration.from_pretrained(model_id) +processor = AutoProcessor.from_pretrained(model_id) + +prompt = "What is on the flower?" +image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg?download=true" +raw_image = Image.open(requests.get(image_file, stream=True).raw) +inputs = processor(raw_image, prompt, return_tensors="pt") +output = model.generate(**inputs, max_new_tokens=20) + +print(processor.decode(output[0], skip_special_tokens=True)[len(prompt):]) +``` + +- PaliGemma is not meant for conversational use, and it works best when fine-tuning to a specific use case. Some downstream tasks on which PaliGemma can be fine-tuned include image captioning, visual question answering (VQA), object detection, referring expression segmentation and document understanding. +- One can use `PaliGemmaProcessor` to prepare images, text and optional labels for the model. When fine-tuning a PaliGemma model, the `suffix` argument can be passed to the processor which creates the `labels` for the model: + +```python +prompt = "What is on the flower?" +answer = "a bee" +inputs = processor(images=raw_image, text=prompt, suffix=answer, return_tensors="pt") +``` + +## Resources[[resources]] + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with PaliGemma. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + +- A blog post introducing all the features of PaliGemma can be found [here](https://huggingface.co/blog/paligemma). +- Demo notebooks on how to fine-tune PaliGemma for VQA with the Trainer API along with inference can be found [here](https://github.com/huggingface/notebooks/tree/main/examples/paligemma). +- Demo notebooks on how to fine-tune PaliGemma on a custom dataset (receipt image -> JSON) along with inference can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/PaliGemma). 🌎 + +## PaliGemmaConfig[[transformers.PaliGemmaConfig]] + +[[autodoc]] PaliGemmaConfig + +## PaliGemmaProcessor[[transformers.PaliGemmaProcessor]] + +[[autodoc]] PaliGemmaProcessor + +## PaliGemmaForConditionalGeneration[[transformers.PaliGemmaForConditionalGeneration]] + +[[autodoc]] PaliGemmaForConditionalGeneration + - forward From 00a7d1d0f7fd47838e9a823b3cc6458996dfd29e Mon Sep 17 00:00:00 2001 From: Sungmin Oh Date: Fri, 20 Sep 2024 16:59:39 +0900 Subject: [PATCH 2/2] feat: nmt draft --- docs/source/ko/model_doc/paligemma.md | 28 ++++++++++++++------------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/docs/source/ko/model_doc/paligemma.md b/docs/source/ko/model_doc/paligemma.md index db68d53300d80b..4358dc970b8e08 100644 --- a/docs/source/ko/model_doc/paligemma.md +++ b/docs/source/ko/model_doc/paligemma.md @@ -16,20 +16,22 @@ rendered properly in your Markdown viewer. # PaliGemma[[paligemma]] -## Overview[[overview]] +## 오버뷰[[overview]] -The PaliGemma model was proposed in [PaliGemma – Google's Cutting-Edge Open Vision Language Model](https://huggingface.co/blog/paligemma) by Google. It is a 3B vision-language model composed by a [SigLIP](siglip) vision encoder and a [Gemma](gemma) language decoder linked by a multimodal linear projection. It cuts an image into a fixed number of VIT tokens and prepends it to an optional prompt. One particularity is that the model uses full block attention on all the image tokens plus the input text tokens. It comes in 3 resolutions, 224x224, 448x448 and 896x896 with 3 base models, with 55 fine-tuned versions for different tasks, and 2 mix models. +PaliGemma 모델은 구글이 제안한 [PaliGemma – Google의 최첨단 오픈 비전 언어 모델](https://huggingface.co/blog/paligemma)로 소개 되었습니다. PaliGemma는 [SigLIP](siglip) 비전 인코더와 [Gemma](gemma) 언어 인코더로 구성된 3B 규모의 비전-언어 모델로, 두 인코더가 멀티모달 선형 두영으로 연결되어 있습니다. 이 모델은 이미지를 고정된 수의 VIT토큰으로 분할하고 이를 선택적 프롬프트 앞에 추가 하며, 모든 이미지 토큰과 입력 텍스트 토큰에 대해 전체 블록 어텐션을 사용하는 특징을 가지고 있습니다. + +PaliGemma는 224x224, 448x448, 896x896의 3가지 해상도로 제공되며, 3개의 기본 모델과 55개의 다양한 작업에 대해 미세 조정된 버전, 그리고 2개의 혼합 모델이 있습니다. drawing - PaliGemma architecture. Taken from the blog post. + PaliGemma 아키텍처 블로그 포스트. -This model was contributed by [Molbap](https://huggingface.co/Molbap). +이 모델은 [Molbap](https://huggingface.co/Molbap)에 의해 기여 되었습니다. -## Usage tips[[usage-tips]] +## 사용팁[[usage-tips]] -Inference with PaliGemma can be performed as follows: +PaliGemma의 추론은 다음처럼 수행됩니다: ```python from transformers import AutoProcessor, PaliGemmaForConditionalGeneration @@ -47,8 +49,8 @@ output = model.generate(**inputs, max_new_tokens=20) print(processor.decode(output[0], skip_special_tokens=True)[len(prompt):]) ``` -- PaliGemma is not meant for conversational use, and it works best when fine-tuning to a specific use case. Some downstream tasks on which PaliGemma can be fine-tuned include image captioning, visual question answering (VQA), object detection, referring expression segmentation and document understanding. -- One can use `PaliGemmaProcessor` to prepare images, text and optional labels for the model. When fine-tuning a PaliGemma model, the `suffix` argument can be passed to the processor which creates the `labels` for the model: +- PaliGemma는 대화용으로 설계되지 않았으며, 특정 사용 사례에 대해 미세 조정할 때 가장 잘 작동합니다. PaliGemma를 미세 조정할 수 있는 몇 가지 하위 작업에는 이미지 캡셔닝, 시각적 질문 답변(VQA), 오브젝트 디텍션, 참조 표현 분할 및 문서 이해가 포함됩니다. +- `PaliGemmaProcessor`를 사용하여 모델에 필요한 이미지, 텍스트 및 선택적 레이블을 준비하는데 사용할 수 있습니다. PaliGemma 모델을 미세 조정할 때는, 프로세서에 `suffix`인자를 전달하여 다음 처럼 모델의 `labels`를 생성할 수 있습니다: ```python prompt = "What is on the flower?" @@ -56,13 +58,13 @@ answer = "a bee" inputs = processor(images=raw_image, text=prompt, suffix=answer, return_tensors="pt") ``` -## Resources[[resources]] +## 리소스[[resources]] -A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with PaliGemma. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. +PaliGemma를 시작하는 데 도움이 되는 Hugging Face와 community 자료 목록(🌎로 표시됨) 입니다.여기에 포함될 자료를 제출하고 싶으시다면 PR(Pull Request)를 열어주세요. 리뷰 해드리겠습니다! 자료는 기존 자료를 복제하는 대신 새로운 내용을 담고 있어야 합니다. -- A blog post introducing all the features of PaliGemma can be found [here](https://huggingface.co/blog/paligemma). -- Demo notebooks on how to fine-tune PaliGemma for VQA with the Trainer API along with inference can be found [here](https://github.com/huggingface/notebooks/tree/main/examples/paligemma). -- Demo notebooks on how to fine-tune PaliGemma on a custom dataset (receipt image -> JSON) along with inference can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/PaliGemma). 🌎 +- PaliGemma의 모든 기능을 소개하는 블로그 포스트는 [이곳](https://huggingface.co/blog/paligemma)에서 찾을 수 있습니다. 🌎 +- Trainer API를 사용하여 VQA(Visual Question Answering)를 위해 PaliGemma를 미세 조정하는 방법과 추론에 대한 데모 노트북은 [이곳](https://github.com/huggingface/notebooks/tree/main/examples/paligemma)에서 찾을 수 있습니다. 🌎 +- 사용자 정의 데이터셋(영수증 이미지 -> JSON)에 대해 PaliGemma를 미세 조정하는 방법과 추론에 대한 데모 노트북은 [이곳](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/PaliGemma)에서 찾을 수 있습니다. 🌎 ## PaliGemmaConfig[[transformers.PaliGemmaConfig]]