Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: some bugs when text input reaches max_tokens of language_model #11669

Open
wants to merge 2 commits into
base: dev-3.x
Choose a base branch
from

Conversation

jiangtann
Copy link

@jiangtann jiangtann commented Apr 26, 2024

Fix 1: RandomSamplingNegPos forget to remove gt_ignore_flags

https://github.com/open-mmlab/mmdetection/blob/dev-3.x/mmdet/datasets/transforms/formatting.py#L109 中,会根据valid_idx = np.where(results['gt_ignore_flags'] == 0)[0]来获得valid_idx,并通过valid_idx从gt_bboxes中取得有效的bboxes。

https://github.com/open-mmlab/mmdetection/blob/dev-3.x/mmdet/datasets/transforms/text_transformers.py#L62 中,如果positive labels的token数之和超过了256,会随机删掉一部分gt_bboxes和gt_labels,但是没有对gt_ignore_flags进行相应的处理,导致在PackDetInputs中,通过valid_idx从gt_bboxes中取得有效的bboxes会出现index越界error。

参考RandomCrop中处理gt_ignore_flags的方式对RandomSamplingNegPos进行相应的修改

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@jiangtann jiangtann changed the title Fix: RandomSamplingNegPos forget to remove gt_ignore_flags Fix: some bugs when text input reaches max_tokens of language_model May 27, 2024
@jiangtann
Copy link
Author

jiangtann commented May 27, 2024

Fix 2: adding special token results in token overflow

经过RandomSamplingNegPos后,使用self.tokenizer.tokenize方法最多可能会得到256个token。

这是因为,https://github.com/open-mmlab/mmdetection/blob/main/mmdet/datasets/transforms/text_transformers.py#L50 当positive_label的长度达到了256个token时,这轮循环不会break,而是会保留。但这会导致使用self.language_model.forward或self.language_model.tokenizer.__call__处理text时,默认会加上[CLS]和[SEP]这两个分别放在开头和结尾的特殊token,导致text token的最大长度能达到258。

这会导致过bert时,由于设置了max_tokens,长度为258的token list会被截断,最后的'.'(token id为1012)和'[SEP]'(token id为102)会丢失,然后在https://github.com/open-mmlab/mmdetection/blob/main/mmdet/models/language_models/bert.py#L59 中,最后一个class的attention_mask和position_ids错误,此时position_ids为全0,即最后一个class对应的所有token之间不会做attention。

有两种修改方法,一种是在RandomSamplingNegPos限制text token最多是254,这样加上两个特殊token也不会overflow。

另一种就是我的修改方法,不加特殊token。由于使用了use_sub_sentence_represent,加不加特殊token对最后得到的text embedding没有任何影响,不加的话可以省两个token。我对bert.py文件的修改保留了默认加特殊token的兼容性。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants