You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a question that has been bothering me for a long time. Why does the segment anything 1 code set a fixed input image size of 1024*1024, but it is based on transformer and should support image input of any size?
The text was updated successfully, but these errors were encountered:
The main limitations on input sizing for vision transformers comes from the patch embedding step and positional encodings applied to the images tokens. Both SAMv1 & v2 use learned position encodings, which exist only for a single input size, which is probably why they're hard-coded for that size.
That being said, it's common to up/down-scale the position encodings to support different input sizes. The v2 model already does this, but the image segmentation code needs a slight modification to probably support it (see issue #138), the video segmentation supports it without modifications (see issue #257). The v1 model can also be modified to support different input sizes and in fact, the v1 model seems more robust to changes to the input resolution than the v2 model.
There is a question that has been bothering me for a long time. Why does the segment anything 1 code set a fixed input image size of 1024*1024, but it is based on transformer and should support image input of any size?
The text was updated successfully, but these errors were encountered: