Ask about the image size #332

wfz666 · 2024-09-25T21:39:01Z

There is a question that has been bothering me for a long time. Why does the segment anything 1 code set a fixed input image size of 1024*1024, but it is based on transformer and should support image input of any size?

heyoeyo · 2024-09-26T17:45:24Z

The main limitations on input sizing for vision transformers comes from the patch embedding step and positional encodings applied to the images tokens. Both SAMv1 & v2 use learned position encodings, which exist only for a single input size, which is probably why they're hard-coded for that size.

That being said, it's common to up/down-scale the position encodings to support different input sizes. The v2 model already does this, but the image segmentation code needs a slight modification to probably support it (see issue #138), the video segmentation supports it without modifications (see issue #257). The v1 model can also be modified to support different input sizes and in fact, the v1 model seems more robust to changes to the input resolution than the v2 model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ask about the image size #332

Ask about the image size #332

wfz666 commented Sep 25, 2024

heyoeyo commented Sep 26, 2024

Ask about the image size #332

Ask about the image size #332

Comments

wfz666 commented Sep 25, 2024

heyoeyo commented Sep 26, 2024