Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ask about the image size #332

Open
wfz666 opened this issue Sep 25, 2024 · 1 comment
Open

Ask about the image size #332

wfz666 opened this issue Sep 25, 2024 · 1 comment

Comments

@wfz666
Copy link

wfz666 commented Sep 25, 2024

There is a question that has been bothering me for a long time. Why does the segment anything 1 code set a fixed input image size of 1024*1024, but it is based on transformer and should support image input of any size?

@heyoeyo
Copy link

heyoeyo commented Sep 26, 2024

The main limitations on input sizing for vision transformers comes from the patch embedding step and positional encodings applied to the images tokens. Both SAMv1 & v2 use learned position encodings, which exist only for a single input size, which is probably why they're hard-coded for that size.

That being said, it's common to up/down-scale the position encodings to support different input sizes. The v2 model already does this, but the image segmentation code needs a slight modification to probably support it (see issue #138), the video segmentation supports it without modifications (see issue #257). The v1 model can also be modified to support different input sizes and in fact, the v1 model seems more robust to changes to the input resolution than the v2 model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants