Add a subclass of `BaseRetriever` for HF mulit-modal retriever: - text (SentenceTransformer) - image (PreTrainedModel?) - audio (PreTrainedModel?) - video (PreTrainedModel?) Closes: #483