CDE models missing #1851

x-tabdeveloping · 2025-01-21T11:56:59Z

We should add CDE before we launch the leaderboard.
We've had a PR on this for months #1521 , but it is going nowhere, so we will probably have to take matters to our own hands.

x-tabdeveloping · 2025-01-21T14:28:20Z

@jxmorris12 Can I ask you to help me out with this?
I'm a bit confused about

Your training data, and what parts of it overlap with MTEB
Your implementation. I tried to do an elementary implementation of your model in MTEB, but I'm not entirely sure whether it is functionally equivalent to yours.
We're a bit pressed for time with the leaderboard release so it would be awesome if I could get your inputs on this.

This is the implementation I've been messing with:

class CDEWrapper(Wrapper):
    def __init__(
        self,
        model_name: str,
        random_state: int = 42,
        **kwargs,
    ) -> None:
        """Wrapper for CDE models.

        Args:
            model_name: The CDE model to load from HuggingFace Hub.
            random_state: Seed for sampling a minicorpus.
            **kwargs: Additional arguments to pass to the wrapper.
        """
        self.model_name = model_name
        self.model = SentenceTransformer(model_name, trust_remote_code=True)
        self.random_state = random_state

    def encode(
        self,
        sentences: Sequence[str],
        **kwargs: Any,
    ) -> np.ndarray:
        """Encodes the given sentences using the encoder.

        Args:
            sentences: The sentences to encode.
            **kwargs: Additional arguments to pass to the encoder.

        Returns:
            The encoded sentences.
        """
        random.seed(42)
        minicorpus_size = self.model[0].config.transductive_corpus_size
        # Sampling mincorpus
        if len(sentences) <= minicorpus_size:
            # We need to sample with replacement if the minicorpus needs to be bigger than
            # the number of sentences
            minicorpus = random.choices(sentences, k=minicorpus_size)
        else:
            minicorpus = random.sample(sentences, minicorpus_size)
        # resetting seed
        random.seed()
        dataset_embeddings = self.model.encode(minicorpus, prompt_name="document")
        return self.model.encode(
            sentences, dataset_embeddings=dataset_embeddings, **kwargs
        )

x-tabdeveloping · 2025-01-21T14:32:37Z

Thanks for the help in advance

jxmorris12 · 2025-01-21T15:06:55Z

Our training data is exactly the same as BGE and all the newer large models. Here's a link: https://huggingface.co/datasets/cfli/bge-full-data

This implementation is close but not completely correct because you need to sample the minicorpus from the documents in each case. I think this will use a minicorpus of queries to embed queries, instead of using a minicorpus of documents to embed queries. If that makes sense.

Samoed · 2025-01-21T15:13:21Z

In encode, queries and passages are passed separately, so I think only the prompt name needs to be changed.

Samoed · 2025-01-21T15:20:03Z

Also, during evaluation, did you use task2prefix_short or task2prefix_long (instructions)?

jxmorris12 · 2025-01-21T15:26:37Z

In encode, queries and passages are passed separately, so I think only the prompt name needs to be changed.

Sorry, but won't this encode then try to establish a minicorpus of queries, which is incorrect?

Also, during evaluation, did you use task2prefix_short or task2prefix_long (instructions)?

task2prefix_short

Samoed · 2025-01-21T15:40:45Z

Sorry, but won't this encode then try to establish a minicorpus of queries, which is incorrect?

Probably, yes. I think it would be simpler to integrate your model into the v2 branch, as it combines Retrieval and Reranking tasks. Additionally, we could add a method for the wrapper, such as pre_retrieval, that processes the corpus and stores this information, along with a method for deleting the stored information when it's no longer needed.

This approach could also extend to Classification/MultilabelClassification tasks, where embeddings would be created only from the training data. For the rest of the tasks, data could be sampled directly. For STS and PairClassification tasks, we might need to sample from both sentence corpora, but I'm not entirely sure about that.

jxmorris12 · 2025-01-21T16:18:55Z

That sounds good.

x-tabdeveloping · 2025-01-22T06:57:58Z

Hmm @Samoed how about I annotate the Metadata, and leave the loader None for now, then you guys can add a proper implementation to v2?

Samoed · 2025-01-22T07:00:54Z

Yes, I'll add implementation

x-tabdeveloping · 2025-01-22T07:07:41Z

@jxmorris12 sorry for bombarding you with questions, but I was wondering if these config files in your repo were for training data or something else: https://github.com/jxmorris12/cde/blob/main/cde/config/bge.yaml

Also, can you specify, which BGE models we are talking about? We have some annotations for BGE training data, but the vanilla models are trained on something else than the bge_m models:

bge_m_training_data = {
    # source: https://arxiv.org/pdf/2402.03216
    "MIRACLRetrieval": ["train"],
    "MIRACLRetrievalHardNegatives": ["train"],
    "MIRACLReranking": ["train"],
    "LeCaRDv2": ["train"],
    "CMedQAv1-reranking": ["train"],
    "CMedQAv2-reranking": ["train"],
    "MrTidyRetrieval": ["train"],
    "T2Reranking": ["train"],
    "MSMARCO": ["train"],
    "MSMARCOHardNegatives": ["train"],
    "NanoMSMARCORetrieval": ["train"],
    "MSMARCO-PL": ["train"],  # translation not trained on
    "NQ": ["train"],
    "NQHardNegatives": ["train"],
    "NanoNQRetrieval": ["train"],
    "NQ-PL": ["train"],  # translation not trained on
    "HotpotQA": ["train"],
    "HotpotQA-PL": ["train"],  # translation not trained on
    "HotpotQAHardNegatives": ["train"],
    # + synthetic data
}

bge_training_data = {
    # source: https://data.baai.ac.cn/details/BAAI-MTP
    "NQ": ["test"],
    "NQHardNegatives": ["test"],
    "AmazonReviewsClassification": [
        "validation",
        "test",
    ],  # assumed from: amazon_reviews_multi
    "MLQARetrieval": [
        "validation",
        "test",
    ],  # assumed from mlqa	(question, context)
    # not in mteb
    # Dataset	Pairs
    # wudao	(title, passage)
    # cmrc2018	(query, context)
    # dureader	(query, context)
    # simclue	(sentence_a, sentence_b)
    # csl	(title, abstract)
    # amazon_reviews_multi	(title, body)
    # wiki_atomic_edits	(base_sentence, edited_sentence)
    # mlqa	(question, context)
    # xlsum	(title, summary) (title, text)
    # "sentence-transformers data": [],  # https://huggingface.co/datasets/sentence-transformers/embedding-training-data # TODO check this further
    # "wikipedia": [],  # title + section title, passage
    # "reddit": [],  # title, body
    # "stackexchange": [],  # (title, upvoted answer) (title+body, upvoted answer)
    # "s2orc": [],  # (title, abstract) (title, citation title) (abstract, citation abstract)
}

bgem3_training_data = {
    # source https://arxiv.org/abs/2402.03216
    "T2Retrieval": ["train"],
    "DuReader": ["train"],
    "MMarcoReranking": ["train"],
    "CMedQAv2-reranking": ["train"],
    "HotpotQA": ["train"],
    "NQ": ["train"],
    "MSMARCO": ["train"],
    "MrTidyRetrieval": ["train"],
    "MIRACLRetrieval": ["train"],
    "CodeSearchNet": ["train"],
    # not in mteb
    # "s2orc"
    # Wikipedia
    # "xP3"
    # "mC4"
    # "CC-News"
    # "MTP"
    # "NLLB"
    # "CCMatrix"
    # TriviaQA
    # COL-IEE
    # PubMedQA
    # SQuAD
    # SimCSE
    # mMARCO-ZH
    # LawGPT
    # NLI-zh2, LeCaRDv2,
    # NLI, MultiLongDoc (their syntetic)
}

x-tabdeveloping · 2025-01-22T07:48:57Z

Okay wait, bge-full-data's content is annotated here, right? https://arxiv.org/pdf/2409.15700

x-tabdeveloping · 2025-01-22T09:30:14Z

Adding model metadata here: #1856

jxmorris12 · 2025-01-22T14:40:34Z

Hey – thanks! Yeah that config is correct. I think that the data originally comes from here: Making Text Embedders Few-Shot Learners

x-tabdeveloping · 2025-01-27T13:17:45Z

I'll close this for now, since the new leaderboard has the model. Let's make sure we add an implementation in the future.

jxmorris12 · 2025-01-27T17:47:48Z

I'm confused what happened here. Did you have any trouble with the implementation? Is there anything I can help with?

Samoed · 2025-01-27T17:52:01Z

No, only model meta was added. I'll add implementation later

x-tabdeveloping closed this as completed Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDE models missing #1851

CDE models missing #1851

x-tabdeveloping commented Jan 21, 2025

x-tabdeveloping commented Jan 21, 2025

x-tabdeveloping commented Jan 21, 2025

jxmorris12 commented Jan 21, 2025

Samoed commented Jan 21, 2025

Samoed commented Jan 21, 2025

jxmorris12 commented Jan 21, 2025

Samoed commented Jan 21, 2025 •

edited

Loading

jxmorris12 commented Jan 21, 2025

x-tabdeveloping commented Jan 22, 2025

Samoed commented Jan 22, 2025

x-tabdeveloping commented Jan 22, 2025

x-tabdeveloping commented Jan 22, 2025

x-tabdeveloping commented Jan 22, 2025

jxmorris12 commented Jan 22, 2025

x-tabdeveloping commented Jan 27, 2025

jxmorris12 commented Jan 27, 2025

Samoed commented Jan 27, 2025

CDE models missing #1851

CDE models missing #1851

Comments

x-tabdeveloping commented Jan 21, 2025

x-tabdeveloping commented Jan 21, 2025

x-tabdeveloping commented Jan 21, 2025

jxmorris12 commented Jan 21, 2025

Samoed commented Jan 21, 2025

Samoed commented Jan 21, 2025

jxmorris12 commented Jan 21, 2025

Samoed commented Jan 21, 2025 • edited Loading

jxmorris12 commented Jan 21, 2025

x-tabdeveloping commented Jan 22, 2025

Samoed commented Jan 22, 2025

x-tabdeveloping commented Jan 22, 2025

x-tabdeveloping commented Jan 22, 2025

x-tabdeveloping commented Jan 22, 2025

jxmorris12 commented Jan 22, 2025

x-tabdeveloping commented Jan 27, 2025

jxmorris12 commented Jan 27, 2025

Samoed commented Jan 27, 2025

Samoed commented Jan 21, 2025 •

edited

Loading