Add multi-modal method(s) #1573

trawler0 · 2024-07-06T10:49:30Z

Hello guys,
Thanks for this amazing repo, it is very useful for me.
I wanted to ask if there is interest in implementing methods like CLIP for image-language pretraining.
I understand that this might not be your actual focus and that web-scale-pretraining might be out of reach, however the paper https://arxiv.org/abs/2305.08675 shows that one can actually get relatively high zero-shot accuracies with effort roughly equal to imagenet pretraining.

guarin · 2024-07-08T06:56:07Z

Hi!
Multi-modal is definitely something we would like to incorporate. There are two main components missing for this: Data loading for text, and NLP models/tokenizers. For both cases we have to decide how to support them. This was quite easy for vision because data loading is pretty standardized and models are in torchvision. For text the landscape is more diverse and we'll have to compare the libraries first. Please let us know if you have any suggestions/inputs!

guarin added the feature request label Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-modal method(s) #1573

Add multi-modal method(s) #1573

trawler0 commented Jul 6, 2024

guarin commented Jul 8, 2024

Add multi-modal method(s) #1573

Add multi-modal method(s) #1573

Comments

trawler0 commented Jul 6, 2024

guarin commented Jul 8, 2024