Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EVAL REQUEST] jina-embeddings-v3 #77

Closed
1 of 17 tasks
kaisugi opened this issue Sep 19, 2024 · 16 comments
Closed
1 of 17 tasks

[EVAL REQUEST] jina-embeddings-v3 #77

kaisugi opened this issue Sep 19, 2024 · 16 comments

Comments

@kaisugi
Copy link

kaisugi commented Sep 19, 2024

モデルの基本情報

name: jina-embeddings-v3
type: XLMRoBERTa (+ LoRA Adapter)
size: 559M (LoRA Adapterを加えると572M)
lang: multilingual

モデル詳細

スクリーンショット 2024-09-19 11 26 48

https://arxiv.org/abs/2409.10173
https://huggingface.co/jinaai/jina-embeddings-v3

seen/unseen申告

JMTEBの評価データセットの中,training splitをモデル学習に使用した,またはvalidation setとして,ハイパラチューニングやearly stoppingに使用したデータセット名をチェックしてください。

  • Classification
    • Amazon Review Classification
    • Amazon Counterfactual Classification
    • Massive Intent Classification
    • Massive Scenario Classification
  • Clustering
    • Livedoor News
    • MewsC-16-ja
  • STS
    • JSTS
    • JSICK
  • Pair Classification
    • PAWS-X-ja
  • Retrieval
    • JAQKET
    • Mr.TyDi-ja (The original English version seems to have been used)
    • JaGovFaqs-22k
    • NLP Journal title-abs
    • NLP Journal title-intro
    • NLP Journal abs-intro
  • Reranking
    • Esci
  • 申告しません

評価スクリプト

その他の情報

@kaisugi
Copy link
Author

kaisugi commented Sep 25, 2024

Note: I found on X(Twitter) that one of the authors (@bwanglzu ) has already completed the evaluations 😳
https://x.com/bo_wangbo/status/1838919204377911477

@lsz05
Copy link
Collaborator

lsz05 commented Sep 26, 2024

Thank you for the information!
I managed to run the evaluation of the model yesterday, but didn't succeed. Keeping debugging now.

@kaisugi
Copy link
Author

kaisugi commented Sep 26, 2024

Great, look forward to the official results 😊

@lsz05
Copy link
Collaborator

lsz05 commented Sep 27, 2024

@kaisugi

I tried the model with fast datasets, but I found that tasks except Classification worked better without LoRA than with LoRA. In Retrieval task, the results were better without prefixes.

The results that is the most similar with https://x.com/bo_wangbo/status/1838919204377911477 are, no prefixes, no LoRA except Classification.

My results are as below:

  • no prefixes, no LoRA except Classification
{
    "Classification": {
        "amazon_counterfactual_classification": {
            "macro_f1": 0.7949948725329687
        },
        "massive_intent_classification": {
            "macro_f1": 0.7766347542682803
        },
        "massive_scenario_classification": {
            "macro_f1": 0.8982075621284786
        }
    },
    "Retrieval": {
        "jagovfaqs_22k": {
            "ndcg@10": 0.7449944044307708
        },
        "nlp_journal_abs_intro": {
            "ndcg@10": 0.9941946751679634
        },
        "nlp_journal_title_abs": {
            "ndcg@10": 0.9717376985433034
        },
        "nlp_journal_title_intro": {
            "ndcg@10": 0.9609029386920315
        }
    },
    "STS": {
        "jsick": {
            "spearman": 0.8146985042196159
        },
        "jsts": {
            "spearman": 0.8068520872331155
        }
    },
    "Clustering": {
        "livedoor_news": {
            "v_measure_score": 0.5036707354224619
        },
        "mewsc16": {
            "v_measure_score": 0.474391205388421
        }
    },
    "PairClassification": {
        "paws_x_ja": {
            "binary_f1": 0.623716814159292
        }
    }
}
  • no prefixes, with LoRA
{
    "Classification": {
        "amazon_counterfactual_classification": {
            "macro_f1": 0.7949948725329687
        },
        "massive_intent_classification": {
            "macro_f1": 0.7766347542682803
        },
        "massive_scenario_classification": {
            "macro_f1": 0.8982075621284786
        }
    },
    "Retrieval": {
        "jagovfaqs_22k": {
            "ndcg@10": 0.7255870901661032
        },
        "nlp_journal_abs_intro": {
            "ndcg@10": 0.9829431790599418
        },
        "nlp_journal_title_abs": {
            "ndcg@10": 0.9552122947731903
        },
        "nlp_journal_title_intro": {
            "ndcg@10": 0.9324205002364649
        }
    },
    "STS": {
        "jsick": {
            "spearman": 0.7816133481804449
        },
        "jsts": {
            "spearman": 0.8193021839272429
        }
    },
    "Clustering": {
        "livedoor_news": {
            "v_measure_score": 0.5387525923415666
        },
        "mewsc16": {
            "v_measure_score": 0.43532523021586217
        }
    },
    "PairClassification": {
        "paws_x_ja": {
            "binary_f1": 0.623716814159292
        }
    }
}
  • with prefixes, with LoRA
{
    "Classification": {
        "amazon_counterfactual_classification": {
            "macro_f1": 0.7949948725329687
        },
        "massive_intent_classification": {
            "macro_f1": 0.7766347542682803
        },
        "massive_scenario_classification": {
            "macro_f1": 0.8982075621284786
        }
    },
    "Retrieval": {
        "jagovfaqs_22k": {
            "ndcg@10": 0.7157443309160252
        },
        "nlp_journal_abs_intro": {
            "ndcg@10": 0.9849100129100982
        },
        "nlp_journal_title_abs": {
            "ndcg@10": 0.9560377251324601
        },
        "nlp_journal_title_intro": {
            "ndcg@10": 0.9372937234643258
        }
    },
    "STS": {
        "jsick": {
            "spearman": 0.7816133481804449
        },
        "jsts": {
            "spearman": 0.8193021839272429
        }
    },
    "Clustering": {
        "livedoor_news": {
            "v_measure_score": 0.5313213726075848
        },
        "mewsc16": {
            "v_measure_score": 0.43532523021586217
        }
    },
    "PairClassification": {
        "paws_x_ja": {
            "binary_f1": 0.623716814159292
        }
    }
}

LoRA settings (if w/):

  • classification: Classification
  • text-matching: STS, PairClassification
  • saparation: Clustering, Reranking
  • retrieval.query: Retrieval (when encoding queries)
  • retrieval.passage: Retrieval (when encoding documents)

Prefix settings (if w/):

@kaisugi
Copy link
Author

kaisugi commented Sep 27, 2024

Thank you so much for your hard work!

@bwanglzu
Copy link

bwanglzu commented Sep 27, 2024

hi @lsz05 @courage i hacked a bit the code to make it work the thing i changed:

  1. src/jmteb/embedders/base.py

in the TextEmbeder class, i added a task parameter to make sure task is correctly send to the encode function.

1E08FB06-F07A-4F5F-A69B-776F9CB665E1

  1. src/jmteb/embedders/sbert_embedder.py

in the SentenceBertEmbedder class, i chanted max_seq_length to 512 since some of the task (MrTidy) is too slow, and i added prompt_name and task into encode function, prompt_name is set to identical task as defined here: we use 2 instructions for retrieval adapter.

84C90704-790D-4429-9A03-657350FF6DE3

  1. only running retrieval task, i modified /src/jmteb/evaluators/retrieval/evaluator.py to send different task during indexing and searching:

86295151-F230-4B2D-B2F0-575272DCF5AB

I agree my code is a bit "dirty" as i only want to quickly check the results :) hopefully you understand. If i missed anything in your code base result in a different eval result please let me know :)

@bwanglzu
Copy link

bwanglzu commented Sep 27, 2024

but i'm also quite surprised (in a good way) your score is better than what i reported lol :) maybe there is something wrong in my code, but at least not worse , for mewsc16 clustering i noticed my score is higher than yours, this is what i have:

{
    "metric_name": "v_measure_score",
    "metric_value": 0.4966872142615049,
    "details": {
        "optimal_clustering_model_name": "AgglomerativeClustering",
        "val_scores": {
            "MiniBatchKMeans": {
                "v_measure_score": 0.4573582252706992,
                "homogeneity_score": 0.49434785878175236,
                "completeness_score": 0.425518738350574
            },
            "AgglomerativeClustering": {
                "v_measure_score": 0.5159727698724647,
                "homogeneity_score": 0.558382996336062,
                "completeness_score": 0.47955005205722434
            },
            "BisectingKMeans": {
                "v_measure_score": 0.45289840369081835,
                "homogeneity_score": 0.4964330478306176,
                "completeness_score": 0.4163836804409629
            },
            "Birch": {
                "v_measure_score": 0.4943869746128702,
                "homogeneity_score": 0.5396066604339305,
                "completeness_score": 0.4561602021543821
            }
        },
        "test_scores": {
            "AgglomerativeClustering": {
                "v_measure_score": 0.4966872142615049,
                "homogeneity_score": 0.5340024254176485,
                "completeness_score": 0.4642464368511074
            }
        }
    }
}

@lsz05
Copy link
Collaborator

lsz05 commented Sep 27, 2024

hi @lsz05 @courage i hacked a bit the code to make it work the thing i changed:

  1. src/jmteb/embedders/base.py

in the TextEmbeder class, i added a task parameter to make sure task is correctly send to the encode function.

1E08FB06-F07A-4F5F-A69B-776F9CB665E1

  1. src/jmteb/embedders/sbert_embedder.py

in the SentenceBertEmbedder class, i chanted max_seq_length to 512 since some of the task (MrTidy) is too slow, and i added prompt_name and task into encode function, prompt_name is set to identical task as defined here: we use 2 instructions for retrieval adapter.

84C90704-790D-4429-9A03-657350FF6DE3

  1. only running retrieval task, i modified /src/jmteb/evaluators/retrieval/evaluator.py to send different task during indexing and searching:

86295151-F230-4B2D-B2F0-575272DCF5AB

I agree my code is a bit "dirty" as i only want to quickly check the results :) hopefully you understand. If i missed anything in your code base result in a different eval result please let me know :)

I think I'm doing the same thing as you in #80

@lsz05
Copy link
Collaborator

lsz05 commented Sep 27, 2024

but i'm also quite surprised (in a good way) your score is better than what i reported lol :) maybe there is something wrong in my code, but at least not worse , for mewsc16 clustering i noticed my score is higher than yours, this is what i have:

{
    "metric_name": "v_measure_score",
    "metric_value": 0.4966872142615049,
    "details": {
        "optimal_clustering_model_name": "AgglomerativeClustering",
        "val_scores": {
            "MiniBatchKMeans": {
                "v_measure_score": 0.4573582252706992,
                "homogeneity_score": 0.49434785878175236,
                "completeness_score": 0.425518738350574
            },
            "AgglomerativeClustering": {
                "v_measure_score": 0.5159727698724647,
                "homogeneity_score": 0.558382996336062,
                "completeness_score": 0.47955005205722434
            },
            "BisectingKMeans": {
                "v_measure_score": 0.45289840369081835,
                "homogeneity_score": 0.4964330478306176,
                "completeness_score": 0.4163836804409629
            },
            "Birch": {
                "v_measure_score": 0.4943869746128702,
                "homogeneity_score": 0.5396066604339305,
                "completeness_score": 0.4561602021543821
            }
        },
        "test_scores": {
            "AgglomerativeClustering": {
                "v_measure_score": 0.4966872142615049,
                "homogeneity_score": 0.5340024254176485,
                "completeness_score": 0.4642464368511074
            }
        }
    }
}

I think I'll have to fix some randomness problems (e.g., fix the random seed in training to make sure everything can be exactly reproduced) in Clustering and Classification (where training is conducted). As the method that works best in dev set will be chosen, in my case Birch worked slightly better in dev but not so well in test, as a result the test score is not as high as your eval.

My result is as following

{
    "metric_name": "v_measure_score",
    "metric_value": 0.474391205388421,
    "details": {
        "optimal_clustering_model_name": "Birch",
        "val_scores": {
            "MiniBatchKMeans": {
                "v_measure_score": 0.45751218122353327,
                "homogeneity_score": 0.5000149261766943,
                "completeness_score": 0.42166906571540486
            },
            "AgglomerativeClustering": {
                "v_measure_score": 0.4884748969401506,
                "homogeneity_score": 0.5211802377702618,
                "completeness_score": 0.45963186760591423
            },
            "BisectingKMeans": {
                "v_measure_score": 0.4051884446721869,
                "homogeneity_score": 0.4429226569148086,
                "completeness_score": 0.3733789195189944
            },
            "Birch": {
                "v_measure_score": 0.48868192903235214,
                "homogeneity_score": 0.529365428957467,
                "completeness_score": 0.45380546454681364
            }
        },
        "test_scores": {
            "Birch": {
                "v_measure_score": 0.474391205388421,
                "homogeneity_score": 0.5112647214750645,
                "completeness_score": 0.44247868671235824
            }
        }
    }
}

@bwanglzu
Copy link

i think your PR looks good, maybe two things:

  1. i'm using model.half() to make it a bit faster.
  2. seq length set to 512 to make it a bit faster

I'm not sure why using LoRA make the performance a bit worse than w.o. LoRA (for example, on STS). Using LoRA is always my default choice :)

one small thing to notice is prefix is only applied to Retrieval, not other tasks.

@bwanglzu
Copy link

btw have you considered to move JMTEB to the official MTEB leaderboard, this will greatly simplify your work.

@lsz05
Copy link
Collaborator

lsz05 commented Oct 2, 2024

@bwanglzu @kaisugi
I have updated the full results in #81 . Would you please take a look?

@lsz05
Copy link
Collaborator

lsz05 commented Oct 2, 2024

i think your PR looks good, maybe two things:

  1. i'm using model.half() to make it a bit faster.
  2. seq length set to 512 to make it a bit faster

I'm not sure why using LoRA make the performance a bit worse than w.o. LoRA (for example, on STS). Using LoRA is always my default choice :)

one small thing to notice is prefix is only applied to Retrieval, not other tasks.

I didn't use half neither set the length to 512, and took about half a day for the full evaluation.

I examined how half affects the evaluation results. The result is, the scores don't change significantly, with time reduced to less than half. (But as it was a weekend, I didn't use half to make it fast)

I applied your prefixes to Retrieval in full evaluation, as you write in your huggingface repo.

@lsz05
Copy link
Collaborator

lsz05 commented Oct 2, 2024

btw have you considered to move JMTEB to the official MTEB leaderboard, this will greatly simplify your work.

We are considering it but also concerning about some difference (e.g., usage of dev set).

Someone's done it, but not fully finished embeddings-benchmark/mteb#749

@Samoed
Copy link

Samoed commented Oct 3, 2024

@lsz05 I'm finishing adding rest of datasets in embeddings-benchmark/mteb#1262

@lsz05
Copy link
Collaborator

lsz05 commented Nov 27, 2024

Let me close this issue by #81 . Feel free to reopen if there's something remained to be done.

@lsz05 lsz05 closed this as completed Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants