-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing models on leaderboards [WIP] #1848
Comments
I went through the first 200 models, since these are the ones that have a mean on the old leaderboard. models_missing_from_eng_classic = [
"BAAI/bge-en-icl",
"yibinlei/LENS-d8000",
"yibinlei/LENS-d4000",
"voyageai/voyage-3-m-exp",
"Alibaba-NLP/gme-Qwen2-VL-7B-Instruct",
"BAAI/bge-en-icl",
"llmrails/ember-v1",
"amazon/Titan-text-embeddings-v2",
"hkunlp/instructor-large",
"hkunlp/instructor-xl",
"hkunlp/instructor-base",
"sentence-transformers/sentence-t5-xxl", # all sentence-t5s are missing really
"elser-v2", # from Elasticsearch
"Hum-Works/lodestone-base-4096-v1",
# LASER and SONAR from Facebook
# Loads of sentence-transformers models we should probably add all of these
# cde models
]
# This might be useful to have since it's the same model with less layers
distillations = [
"TaylorAI/bge-micro-v2",
]
# Something was off about all of these.
# Stalling publishing technical reports or data or incomplete READMEs filled with TODO tags
shady = [
"raghavlight/TDTE",
"tsirif/BinGSE-Meta-Llama-3-8B-Instruct",
"tsirif/BinGSE-Sheared-LLaMA",
"w601sxs/b1ade-embed",
"sam-babayev/sf_model_e5",
]
quant = [
"yoeven/multilingual-e5-large-instruct-Q5_K_M-GGUF",
"yoeven/multilingual-e5-large-instruct-Q5_0-GGUF",
"yoeven/multilingual-e5-large-instruct-Q3_K_S-GGUF",
"JHJHJHJHJ/multilingual-e5-large-instruct-Q5_K_M-GGUF" "parasail-ai/GritLM-7B-vllm",
"Maxthemacaque/onnx-gte-multilingual-base",
"BookingCare/multilingual-e5-base-similarity-v1-onnx-quantized",
]
empty_readme = [
"Labib11/MUG-B-1.6",
"andersonbcdefg/bge-small-4096",
"princeton-nlp/sup-simcse-bert-base-uncased",
]
no_model = [
"twadada/gte_wl",
"twadada/GTE_wl_mv",
"twadada/GTE512_sw",
"twadada/GTE256_sw",
"twadada/l3_wl",
"twadada/wl_sw_256",
"twadada/mv_sw",
"benayad7/concat-e5-small-bge-small-01",
"lixsh6/XLM-3B5-embedding",
"lixsh6/XLM-0B6-embedding",
"lixsh6/MegatronBert-1B3-embedding",
]
# I might be wrong here, and I'm probably missing a lot, just a few examples
outdated = [
"text-embedding-004-256",
"text-embedding-004",
"jinaai/jina-embedding-b-en-v1",
"jinaai/jina-embedding-s-en-v1", # There are probably more of these
"text-similarity-ada-001",
]
# These are cases where there is an original model and most of them are just duplicate entries
copies = [
"BASF-AI/nomic-embed-text-v1"
"BASF-AI/nomic-embed-text-v1.5"
"fdehlinger/english-4U-bge-small",
"aliakseilabanau/bge-small-en",
"lightonai/modernbert-embed-large",
"lightonai/modernbert-embed-large-unsupervised",
"nomic-ai/nomic-embed-text-v1.5-128", # I know these are not duplicate entries, but do we really need to have all sizes for variable size embeddings?
"nomic-ai/nomic-embed-text-v1.5-256",
"nomic-ai/nomic-embed-text-v1.5-512",
"jncraton/multilingual-e5-small-ct2-int8",
] |
Here is my personal take on this:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I checked to top 10 models for each leaderboard. They seem to be missing the following scores
MTEB(eng, classic):
MTEB(chinese)
@Samoed can I ask you to add the missing models (results to the results repo + model meta). Feel free to add a filler class for "modelnotimplemented" in the loader (otherwise we will never catch up with model releases).
The text was updated successfully, but these errors were encountered: