Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDE models missing #1851

Closed
x-tabdeveloping opened this issue Jan 21, 2025 · 17 comments
Closed

CDE models missing #1851

x-tabdeveloping opened this issue Jan 21, 2025 · 17 comments

Comments

@x-tabdeveloping
Copy link
Collaborator

We should add CDE before we launch the leaderboard.
We've had a PR on this for months #1521 , but it is going nowhere, so we will probably have to take matters to our own hands.

@x-tabdeveloping
Copy link
Collaborator Author

@jxmorris12 Can I ask you to help me out with this?
I'm a bit confused about

  1. Your training data, and what parts of it overlap with MTEB
  2. Your implementation. I tried to do an elementary implementation of your model in MTEB, but I'm not entirely sure whether it is functionally equivalent to yours.
    We're a bit pressed for time with the leaderboard release so it would be awesome if I could get your inputs on this.

This is the implementation I've been messing with:

class CDEWrapper(Wrapper):
    def __init__(
        self,
        model_name: str,
        random_state: int = 42,
        **kwargs,
    ) -> None:
        """Wrapper for CDE models.

        Args:
            model_name: The CDE model to load from HuggingFace Hub.
            random_state: Seed for sampling a minicorpus.
            **kwargs: Additional arguments to pass to the wrapper.
        """
        self.model_name = model_name
        self.model = SentenceTransformer(model_name, trust_remote_code=True)
        self.random_state = random_state

    def encode(
        self,
        sentences: Sequence[str],
        **kwargs: Any,
    ) -> np.ndarray:
        """Encodes the given sentences using the encoder.

        Args:
            sentences: The sentences to encode.
            **kwargs: Additional arguments to pass to the encoder.

        Returns:
            The encoded sentences.
        """
        random.seed(42)
        minicorpus_size = self.model[0].config.transductive_corpus_size
        # Sampling mincorpus
        if len(sentences) <= minicorpus_size:
            # We need to sample with replacement if the minicorpus needs to be bigger than
            # the number of sentences
            minicorpus = random.choices(sentences, k=minicorpus_size)
        else:
            minicorpus = random.sample(sentences, minicorpus_size)
        # resetting seed
        random.seed()
        dataset_embeddings = self.model.encode(minicorpus, prompt_name="document")
        return self.model.encode(
            sentences, dataset_embeddings=dataset_embeddings, **kwargs
        )

@x-tabdeveloping
Copy link
Collaborator Author

Thanks for the help in advance

@jxmorris12
Copy link

Our training data is exactly the same as BGE and all the newer large models. Here's a link: https://huggingface.co/datasets/cfli/bge-full-data

This implementation is close but not completely correct because you need to sample the minicorpus from the documents in each case. I think this will use a minicorpus of queries to embed queries, instead of using a minicorpus of documents to embed queries. If that makes sense.

@Samoed
Copy link
Collaborator

Samoed commented Jan 21, 2025

In encode, queries and passages are passed separately, so I think only the prompt name needs to be changed.

@Samoed
Copy link
Collaborator

Samoed commented Jan 21, 2025

Also, during evaluation, did you use task2prefix_short or task2prefix_long (instructions)?

@jxmorris12
Copy link

In encode, queries and passages are passed separately, so I think only the prompt name needs to be changed.

Sorry, but won't this encode then try to establish a minicorpus of queries, which is incorrect?

Also, during evaluation, did you use task2prefix_short or task2prefix_long (instructions)?

task2prefix_short

@Samoed
Copy link
Collaborator

Samoed commented Jan 21, 2025

Sorry, but won't this encode then try to establish a minicorpus of queries, which is incorrect?

Probably, yes. I think it would be simpler to integrate your model into the v2 branch, as it combines Retrieval and Reranking tasks. Additionally, we could add a method for the wrapper, such as pre_retrieval, that processes the corpus and stores this information, along with a method for deleting the stored information when it's no longer needed.

This approach could also extend to Classification/MultilabelClassification tasks, where embeddings would be created only from the training data. For the rest of the tasks, data could be sampled directly. For STS and PairClassification tasks, we might need to sample from both sentence corpora, but I'm not entirely sure about that.

@jxmorris12
Copy link

That sounds good.

@x-tabdeveloping
Copy link
Collaborator Author

Hmm @Samoed how about I annotate the Metadata, and leave the loader None for now, then you guys can add a proper implementation to v2?

@Samoed
Copy link
Collaborator

Samoed commented Jan 22, 2025

Yes, I'll add implementation

@x-tabdeveloping
Copy link
Collaborator Author

@jxmorris12 sorry for bombarding you with questions, but I was wondering if these config files in your repo were for training data or something else: https://github.com/jxmorris12/cde/blob/main/cde/config/bge.yaml

Also, can you specify, which BGE models we are talking about? We have some annotations for BGE training data, but the vanilla models are trained on something else than the bge_m models:

bge_m_training_data = {
    # source: https://arxiv.org/pdf/2402.03216
    "MIRACLRetrieval": ["train"],
    "MIRACLRetrievalHardNegatives": ["train"],
    "MIRACLReranking": ["train"],
    "LeCaRDv2": ["train"],
    "CMedQAv1-reranking": ["train"],
    "CMedQAv2-reranking": ["train"],
    "MrTidyRetrieval": ["train"],
    "T2Reranking": ["train"],
    "MSMARCO": ["train"],
    "MSMARCOHardNegatives": ["train"],
    "NanoMSMARCORetrieval": ["train"],
    "MSMARCO-PL": ["train"],  # translation not trained on
    "NQ": ["train"],
    "NQHardNegatives": ["train"],
    "NanoNQRetrieval": ["train"],
    "NQ-PL": ["train"],  # translation not trained on
    "HotpotQA": ["train"],
    "HotpotQA-PL": ["train"],  # translation not trained on
    "HotpotQAHardNegatives": ["train"],
    # + synthetic data
}

bge_training_data = {
    # source: https://data.baai.ac.cn/details/BAAI-MTP
    "NQ": ["test"],
    "NQHardNegatives": ["test"],
    "AmazonReviewsClassification": [
        "validation",
        "test",
    ],  # assumed from: amazon_reviews_multi
    "MLQARetrieval": [
        "validation",
        "test",
    ],  # assumed from mlqa	(question, context)
    # not in mteb
    # Dataset	Pairs
    # wudao	(title, passage)
    # cmrc2018	(query, context)
    # dureader	(query, context)
    # simclue	(sentence_a, sentence_b)
    # csl	(title, abstract)
    # amazon_reviews_multi	(title, body)
    # wiki_atomic_edits	(base_sentence, edited_sentence)
    # mlqa	(question, context)
    # xlsum	(title, summary) (title, text)
    # "sentence-transformers data": [],  # https://huggingface.co/datasets/sentence-transformers/embedding-training-data # TODO check this further
    # "wikipedia": [],  # title + section title, passage
    # "reddit": [],  # title, body
    # "stackexchange": [],  # (title, upvoted answer) (title+body, upvoted answer)
    # "s2orc": [],  # (title, abstract) (title, citation title) (abstract, citation abstract)
}

bgem3_training_data = {
    # source https://arxiv.org/abs/2402.03216
    "T2Retrieval": ["train"],
    "DuReader": ["train"],
    "MMarcoReranking": ["train"],
    "CMedQAv2-reranking": ["train"],
    "HotpotQA": ["train"],
    "NQ": ["train"],
    "MSMARCO": ["train"],
    "MrTidyRetrieval": ["train"],
    "MIRACLRetrieval": ["train"],
    "CodeSearchNet": ["train"],
    # not in mteb
    # "s2orc"
    # Wikipedia
    # "xP3"
    # "mC4"
    # "CC-News"
    # "MTP"
    # "NLLB"
    # "CCMatrix"
    # TriviaQA
    # COL-IEE
    # PubMedQA
    # SQuAD
    # SimCSE
    # mMARCO-ZH
    # LawGPT
    # NLI-zh2, LeCaRDv2,
    # NLI, MultiLongDoc (their syntetic)
}

@x-tabdeveloping
Copy link
Collaborator Author

Okay wait, bge-full-data's content is annotated here, right? https://arxiv.org/pdf/2409.15700

@x-tabdeveloping
Copy link
Collaborator Author

Adding model metadata here: #1856

@jxmorris12
Copy link

Hey – thanks! Yeah that config is correct. I think that the data originally comes from here: Making Text Embedders Few-Shot Learners

@x-tabdeveloping
Copy link
Collaborator Author

I'll close this for now, since the new leaderboard has the model. Let's make sure we add an implementation in the future.

@jxmorris12
Copy link

I'm confused what happened here. Did you have any trouble with the implementation? Is there anything I can help with?

@Samoed
Copy link
Collaborator

Samoed commented Jan 27, 2025

No, only model meta was added. I'll add implementation later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants