You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello guys,
Thanks for this amazing repo, it is very useful for me.
I wanted to ask if there is interest in implementing methods like CLIP for image-language pretraining.
I understand that this might not be your actual focus and that web-scale-pretraining might be out of reach, however the paper https://arxiv.org/abs/2305.08675 shows that one can actually get relatively high zero-shot accuracies with effort roughly equal to imagenet pretraining.
The text was updated successfully, but these errors were encountered:
Hi!
Multi-modal is definitely something we would like to incorporate. There are two main components missing for this: Data loading for text, and NLP models/tokenizers. For both cases we have to decide how to support them. This was quite easy for vision because data loading is pretty standardized and models are in torchvision. For text the landscape is more diverse and we'll have to compare the libraries first. Please let us know if you have any suggestions/inputs!
Hello guys,
Thanks for this amazing repo, it is very useful for me.
I wanted to ask if there is interest in implementing methods like CLIP for image-language pretraining.
I understand that this might not be your actual focus and that web-scale-pretraining might be out of reach, however the paper https://arxiv.org/abs/2305.08675 shows that one can actually get relatively high zero-shot accuracies with effort roughly equal to imagenet pretraining.
The text was updated successfully, but these errors were encountered: