Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new zero-shot examples #32483

Draft
wants to merge 72 commits into
base: main
Choose a base branch
from

Conversation

SangbumChoi
Copy link
Contributor

What does this PR do?

Fixes #32459

Following two PR should be merged before merging this example!
#31828
#31964

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

EduardoPach and others added 30 commits June 25, 2024 13:54
@SangbumChoi SangbumChoi marked this pull request as ready for review August 8, 2024 08:10
@SangbumChoi
Copy link
Contributor Author

SangbumChoi commented Aug 8, 2024

@qubvel Hi Pavel this PR is ready for review. However, there is one problem with using evaluation process in Trainer.

In order to use the Trainer from the transformers we have to get input_ids for the post_processing pipeline which is from the tokenizer output. However, transformers TrainingArgument include_input_for_metric is set to get input variable of first index of class variable which is pixel_values. This is hard-coded so I cannot get the input_ids for the input. Also cc @muellerzr.

main_input_name = getattr(self.model, "main_input_name", "input_ids")

There is three ways for this

  1. having additional argument to set the name
  2. change the order of forward() (not preferred)
  3. change to main_input_name = getattr(self.model, "input_ids", "input_ids")
    WDYT?

@qubvel
Copy link
Member

qubvel commented Aug 8, 2024

Hi @SangbumChoi, thanks for working on this!

Is there any standard batch size that hugginface suggest? (In my experience maximum batch size was 2 with 24GB) The reason why I am asking is unlike standard detr, GroundingDino cannot handle batch size 8 for normal single GPU.

I think it is fine to have batch size 2 if the model can be trained with such a batch size. In README and a comment we can mention how much memory one would require to run the script and also advise using gradient accumulation, lower precision, and multi-GPU setup (if it is supported).

@qubvel
Copy link
Member

qubvel commented Aug 8, 2024

However, there is one problem with using evaluation process in Trainer.
In order to use the Trainer from the transformers we have to get input_ids for the post_processing pipeline which is from the tokenizer output. However, transformers TrainingArgument include_input_for_metric is set to get input variable of first index of class variable which is pixel_values. This is hard-coded so I cannot get the input_ids for the input.

We can probably move these lines

main_input_name = getattr(self.model, "main_input_name", "input_ids")
inputs_decode = self._prepare_input(inputs[main_input_name]) if args.include_inputs_for_metrics else None

to a separate method of Trainer, for example, _get_inputs(...). Then we will be able to override it in a new ZeroShotTrainer class in a script to return required inputs.

@muellerzr
Copy link
Contributor

agreed re; get_inputs

@SangbumChoi
Copy link
Contributor Author

Let me work on pavel's suggestion and ping again. Thanks for the direction!

@SangbumChoi SangbumChoi marked this pull request as draft August 8, 2024 13:54
@SangbumChoi
Copy link
Contributor Author

@qubvel Hi requesting for first review. I will state changes from the original object_detection.py

  1. Instead of wandb logging graph, I just added table of GPU consumption (since there is no other architecture to compare)
  2. Also as you might know calculating mAP from zero-shot is not accurate. Currently I have made workaround which is the logic that if the output text is directly in label2id then match otherwise treat as background object as 0. (We can discuss this if it seems inappropriate). See def convert_zero_shot_to_coco_format
  3. Added ZeroShotTrainer to override Trainer with def _get_input_by_name

Also kudos to @EduardoPach for backbone_freeze and text_backbone_freeze. Which make the output more stable :)

Before finetune
Screenshot 2024-08-09 at 2 15 04 PM
After finetune
Screenshot 2024-08-09 at 2 15 15 PM

@SangbumChoi SangbumChoi marked this pull request as ready for review August 9, 2024 05:59
Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SangbumChoi, thanks for iterating! Great work!

I left a comment regarding Trainer, also don't know if it deserves a separate PR. Overall looks very good to me, but I didn't do an in-depth view yet (let's wait for loss PR to be merged). For these examples the tests have to be added and information in README (see files in this directory). Also, feel free to add resources to docs of Grounding DINO. Btw, did you try to fine-tune OwlV2?

src/transformers/trainer.py Outdated Show resolved Hide resolved
@SangbumChoi
Copy link
Contributor Author

SangbumChoi commented Aug 9, 2024

@qubvel Sounds good. I can wait until the loss PR get merged.

Btw, did you try to fine-tune OwlV2?

I didn't aware that we can fine-tune OwlV2, let me see while waiting loss PR!

@SangbumChoi SangbumChoi marked this pull request as draft August 11, 2024 10:50
post_processed_output = processor.post_process_grounded_object_detection(
output, input_ids, box_threshold=box_threshold, text_threshold=text_threshold, target_sizes=target_sizes
)
post_processed_output = convert_zero_shot_to_coco_format(post_processed_output, label2id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SangbumChoi Not sure if this will still work if random_text_promt=True because the class_labels from targets will not necessarily follow label2id

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Zero-shot finetuning examples
4 participants