-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDE models missing #1851
Comments
@jxmorris12 Can I ask you to help me out with this?
This is the implementation I've been messing with: class CDEWrapper(Wrapper):
def __init__(
self,
model_name: str,
random_state: int = 42,
**kwargs,
) -> None:
"""Wrapper for CDE models.
Args:
model_name: The CDE model to load from HuggingFace Hub.
random_state: Seed for sampling a minicorpus.
**kwargs: Additional arguments to pass to the wrapper.
"""
self.model_name = model_name
self.model = SentenceTransformer(model_name, trust_remote_code=True)
self.random_state = random_state
def encode(
self,
sentences: Sequence[str],
**kwargs: Any,
) -> np.ndarray:
"""Encodes the given sentences using the encoder.
Args:
sentences: The sentences to encode.
**kwargs: Additional arguments to pass to the encoder.
Returns:
The encoded sentences.
"""
random.seed(42)
minicorpus_size = self.model[0].config.transductive_corpus_size
# Sampling mincorpus
if len(sentences) <= minicorpus_size:
# We need to sample with replacement if the minicorpus needs to be bigger than
# the number of sentences
minicorpus = random.choices(sentences, k=minicorpus_size)
else:
minicorpus = random.sample(sentences, minicorpus_size)
# resetting seed
random.seed()
dataset_embeddings = self.model.encode(minicorpus, prompt_name="document")
return self.model.encode(
sentences, dataset_embeddings=dataset_embeddings, **kwargs
) |
Thanks for the help in advance |
Our training data is exactly the same as BGE and all the newer large models. Here's a link: https://huggingface.co/datasets/cfli/bge-full-data This implementation is close but not completely correct because you need to sample the minicorpus from the documents in each case. I think this will use a minicorpus of queries to embed queries, instead of using a minicorpus of documents to embed queries. If that makes sense. |
In |
Also, during evaluation, did you use |
Sorry, but won't this encode then try to establish a minicorpus of queries, which is incorrect?
|
Probably, yes. I think it would be simpler to integrate your model into the This approach could also extend to Classification/MultilabelClassification tasks, where embeddings would be created only from the training data. For the rest of the tasks, data could be sampled directly. For STS and PairClassification tasks, we might need to sample from both sentence corpora, but I'm not entirely sure about that. |
That sounds good. |
Hmm @Samoed how about I annotate the Metadata, and leave the loader |
Yes, I'll add implementation |
@jxmorris12 sorry for bombarding you with questions, but I was wondering if these config files in your repo were for training data or something else: https://github.com/jxmorris12/cde/blob/main/cde/config/bge.yaml Also, can you specify, which BGE models we are talking about? We have some annotations for BGE training data, but the vanilla models are trained on something else than the bge_m models: bge_m_training_data = {
# source: https://arxiv.org/pdf/2402.03216
"MIRACLRetrieval": ["train"],
"MIRACLRetrievalHardNegatives": ["train"],
"MIRACLReranking": ["train"],
"LeCaRDv2": ["train"],
"CMedQAv1-reranking": ["train"],
"CMedQAv2-reranking": ["train"],
"MrTidyRetrieval": ["train"],
"T2Reranking": ["train"],
"MSMARCO": ["train"],
"MSMARCOHardNegatives": ["train"],
"NanoMSMARCORetrieval": ["train"],
"MSMARCO-PL": ["train"], # translation not trained on
"NQ": ["train"],
"NQHardNegatives": ["train"],
"NanoNQRetrieval": ["train"],
"NQ-PL": ["train"], # translation not trained on
"HotpotQA": ["train"],
"HotpotQA-PL": ["train"], # translation not trained on
"HotpotQAHardNegatives": ["train"],
# + synthetic data
}
bge_training_data = {
# source: https://data.baai.ac.cn/details/BAAI-MTP
"NQ": ["test"],
"NQHardNegatives": ["test"],
"AmazonReviewsClassification": [
"validation",
"test",
], # assumed from: amazon_reviews_multi
"MLQARetrieval": [
"validation",
"test",
], # assumed from mlqa (question, context)
# not in mteb
# Dataset Pairs
# wudao (title, passage)
# cmrc2018 (query, context)
# dureader (query, context)
# simclue (sentence_a, sentence_b)
# csl (title, abstract)
# amazon_reviews_multi (title, body)
# wiki_atomic_edits (base_sentence, edited_sentence)
# mlqa (question, context)
# xlsum (title, summary) (title, text)
# "sentence-transformers data": [], # https://huggingface.co/datasets/sentence-transformers/embedding-training-data # TODO check this further
# "wikipedia": [], # title + section title, passage
# "reddit": [], # title, body
# "stackexchange": [], # (title, upvoted answer) (title+body, upvoted answer)
# "s2orc": [], # (title, abstract) (title, citation title) (abstract, citation abstract)
}
bgem3_training_data = {
# source https://arxiv.org/abs/2402.03216
"T2Retrieval": ["train"],
"DuReader": ["train"],
"MMarcoReranking": ["train"],
"CMedQAv2-reranking": ["train"],
"HotpotQA": ["train"],
"NQ": ["train"],
"MSMARCO": ["train"],
"MrTidyRetrieval": ["train"],
"MIRACLRetrieval": ["train"],
"CodeSearchNet": ["train"],
# not in mteb
# "s2orc"
# Wikipedia
# "xP3"
# "mC4"
# "CC-News"
# "MTP"
# "NLLB"
# "CCMatrix"
# TriviaQA
# COL-IEE
# PubMedQA
# SQuAD
# SimCSE
# mMARCO-ZH
# LawGPT
# NLI-zh2, LeCaRDv2,
# NLI, MultiLongDoc (their syntetic)
} |
Okay wait, bge-full-data's content is annotated here, right? https://arxiv.org/pdf/2409.15700 |
Adding model metadata here: #1856 |
Hey – thanks! Yeah that config is correct. I think that the data originally comes from here: Making Text Embedders Few-Shot Learners |
I'll close this for now, since the new leaderboard has the model. Let's make sure we add an implementation in the future. |
I'm confused what happened here. Did you have any trouble with the implementation? Is there anything I can help with? |
No, only model meta was added. I'll add implementation later |
We should add CDE before we launch the leaderboard.
We've had a PR on this for months #1521 , but it is going nowhere, so we will probably have to take matters to our own hands.
The text was updated successfully, but these errors were encountered: