Added SemRel2024

LazarusNLP · Jul 4, 2024 · 276222e · 276222e
1 parent 35941ca
commit 276222e
Show file tree

Hide file tree

Showing 7 changed files with 116 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -14,10 +14,12 @@ Inspired by [Thai Sentence Vector Benchmark](https://github.com/mrpeerat/Thai-Se
 
 ### Semantic Textual Similarity
 
-We believe that a synthetic baseline is better than no baseline. Therefore, we followed approached done in the Thai Sentence Vector Benchmark project and translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set.
+We followed approached done in the Thai Sentence Vector Benchmark project and translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set.
 
 > You can find the translated dataset on [🤗 HuggingFace Hub](https://huggingface.co/datasets/LazarusNLP/stsb_mt_id).
 
+Further, we will similarly be evaluating our models on the [SemRel2024](https://huggingface.co/datasets/SemRel/SemRel2024) dataset which contains human-annotated, Indonesian semantic textual relatedness (STR) data. The dataset consists of two splits: `dev` and `test`. We will be evaluating our models' Spearman correlation score on both splits.
+
 ### Retrieval
 
 To evaluate our models' capability to perform retrieval tasks, we evaluate them on Indonesian subsets of MIRACL and TyDiQA datasets. In both datasets, the model's ability to retrieve relevant documents given a query is tested. We employ R@1 (top-1 accuracy), MRR@10, and nDCG@10 metrics to measure our model's performance.
@@ -112,6 +114,30 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S
 | [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)                                                |            79.72             |
 | [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)                                              |            79.44             |
 
+#### SemRel2024: Semantic Textual Relatedness (STR)
+
+| Model                                                                                                                       | `dev` Spearman's Correlation (%) ↑ | `test` Spearman's Correlation (%) ↑ |
+| --------------------------------------------------------------------------------------------------------------------------- | :--------------------------------: | :---------------------------------: |
+| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base)                                              |               30.64                |                36.77                |
+| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base)                                    |               35.95                |                41.73                |
+| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base)                                              |               35.05                |                39.14                |
+| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base)                                |               33.71                |                37.73                |
+| [ConGen-Indo-e5 Small](https://huggingface.co/LazarusNLP/congen-indo-e5-small)                                              |               36.35                |                42.47                |
+| [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base)                                                    |               41.50                |                43.25                |
+| [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base)                                                    |             **42.87**              |                38.78                |
+| [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2)                                              |               41.68                |                40.42                |
+| [all-IndoBERT Base-v4](https://huggingface.co/LazarusNLP/all-indobert-base-v4)                                              |               41.38                |                38.05                |
+| [all-NusaBERT Base-v4](https://huggingface.co/LazarusNLP/all-nusabert-base-v4)                                              |               42.11                |                41.55                |
+| [all-NusaBERT Large-v4](https://huggingface.co/LazarusNLP/all-nusabert-large-v4)                                            |               40.21                |                42.25                |
+| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2)                                              |               39.79                |                43.85                |
+| [all-Indo-e5 Small-v3](https://huggingface.co/LazarusNLP/all-indo-e5-small-v3)                                              |               40.25                |                42.60                |
+| [all-Indo-e5 Small-v4](https://huggingface.co/LazarusNLP/all-indo-e5-small-v4)                                              |               40.20                |                42.90                |
+| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2)   |               37.22                |                49.35                |
+| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) |               34.56                |                37.51                |
+| [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)                                              |               41.92                |              **49.60**              |
+| [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)                                                |               41.29                |                45.04                |
+| [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)                                              |               39.20                |                45.04                |
+
 ### Retrieval
 
 #### MIRACL

diff --git a/docs/evaluation/evaluation.md b/docs/evaluation/evaluation.md
@@ -2,7 +2,7 @@
 
 ## Machine Translated STS-B
 
-To the best of our knowledge, there is no official benchmark on Indonesian sentence embeddings. Inspired by [Thai Sentence Vector Benchmark](https://github.com/mrpeerat/Thai-Sentence-Vector-Benchmark), we translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set. You can find the translated dataset on [🤗 HuggingFace Hub](https://huggingface.co/datasets/LazarusNLP/stsb_mt_id).
+Inspired by [Thai Sentence Vector Benchmark](https://github.com/mrpeerat/Thai-Sentence-Vector-Benchmark), we translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set. You can find the translated dataset on [🤗 HuggingFace Hub](https://huggingface.co/datasets/LazarusNLP/stsb_mt_id).
 
 For practical purposes, we used Sentence Transformer's [`EmbeddingSimilarityEvaluator`](https://www.sbert.net/docs/package_reference/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) to perform inference and evaluate our models.
 
@@ -19,6 +19,26 @@ python sts/eval_sts.py \
     --test-batch-size 32
 ```
 
+## SemRel2024: Semantic Textual Relatedness (STR)
+
+SemRel2024 is a collection of Semantic Textual Relatedness (STR) datasets for 14 languages, including African and Asian languages. The datasets are composed of sentence pairs, each assigned a relatedness score between 0 (completely) unrelated and 1 (maximally related) with a large range of expected relatedness values. SemRel2024 dataset was used as part of the SemEval2024 shared task 1. The task aims to evaluate the ability of systems to measure the semantic relatedness between two sentences.
+
+We used Sentence Transformer's [`EmbeddingSimilarityEvaluator`](https://www.sbert.net/docs/package_reference/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) to perform inference and evaluate our models.
+
+### Example
+
+```sh
+python sts/eval_sts.py \
+    --model-name LazarusNLP/congen-indobert-base \
+    --test-dataset-name SemRel/SemRel2024 \
+    --test-dataset-config ind \
+    --test-dataset-split test \
+    --test-text-column-1 sentence1 \
+    --test-text-column-2 sentence2 \
+    --test-label-column label \
+    --test-batch-size 32
+```
+
 ## MIRACL (Multilingual Information Retrieval Across a Continuum of Languages)
 
 MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world. We evaluated our models on the Indonesian subset of MIRACL.

diff --git a/docs/index.md b/docs/index.md
@@ -14,10 +14,12 @@ Inspired by [Thai Sentence Vector Benchmark](https://github.com/mrpeerat/Thai-Se
 
 ### Semantic Textual Similarity
 
-We believe that a synthetic baseline is better than no baseline. Therefore, we followed approached done in the Thai Sentence Vector Benchmark project and translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set.
+We followed approached done in the Thai Sentence Vector Benchmark project and translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set.
 
 > You can find the translated dataset on [🤗 HuggingFace Hub](https://huggingface.co/datasets/LazarusNLP/stsb_mt_id).
 
+Further, we will similarly be evaluating our models on the [SemRel2024](https://huggingface.co/datasets/SemRel/SemRel2024) dataset which contains human-annotated, Indonesian semantic textual relatedness (STR) data. The dataset consists of two splits: `dev` and `test`. We will be evaluating our models' Spearman correlation score on both splits.
+
 ### Retrieval
 
 To evaluate our models' capability to perform retrieval tasks, we evaluate them on Indonesian subsets of MIRACL and TyDiQA datasets. In both datasets, the model's ability to retrieve relevant documents given a query is tested. We employ R@1 (top-1 accuracy), MRR@10, and nDCG@10 metrics to measure our model's performance.
@@ -109,6 +111,30 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S
 | [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)                                                |            79.72             |
 | [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)                                              |            79.44             |
 
+#### SemRel2024: Semantic Textual Relatedness (STR)
+
+| Model                                                                                                                       | `dev` Spearman's Correlation (%) ↑ | `test` Spearman's Correlation (%) ↑ |
+| --------------------------------------------------------------------------------------------------------------------------- | :--------------------------------: | :---------------------------------: |
+| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base)                                              |               30.64                |                36.77                |
+| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base)                                    |               35.95                |                41.73                |
+| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base)                                              |               35.05                |                39.14                |
+| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base)                                |               33.71                |                37.73                |
+| [ConGen-Indo-e5 Small](https://huggingface.co/LazarusNLP/congen-indo-e5-small)                                              |               36.35                |                42.47                |
+| [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base)                                                    |               41.50                |                43.25                |
+| [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base)                                                    |             **42.87**              |                38.78                |
+| [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2)                                              |               41.68                |                40.42                |
+| [all-IndoBERT Base-v4](https://huggingface.co/LazarusNLP/all-indobert-base-v4)                                              |               41.38                |                38.05                |
+| [all-NusaBERT Base-v4](https://huggingface.co/LazarusNLP/all-nusabert-base-v4)                                              |               42.11                |                41.55                |
+| [all-NusaBERT Large-v4](https://huggingface.co/LazarusNLP/all-nusabert-large-v4)                                            |               40.21                |                42.25                |
+| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2)                                              |               39.79                |                43.85                |
+| [all-Indo-e5 Small-v3](https://huggingface.co/LazarusNLP/all-indo-e5-small-v3)                                              |               40.25                |                42.60                |
+| [all-Indo-e5 Small-v4](https://huggingface.co/LazarusNLP/all-indo-e5-small-v4)                                              |               40.20                |                42.90                |
+| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2)   |               37.22                |                49.35                |
+| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) |               34.56                |                37.51                |
+| [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)                                              |               41.92                |              **49.60**              |
+| [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)                                                |               41.29                |                45.04                |
+| [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)                                              |               39.20                |                45.04                |
+
 ### Retrieval
 
 #### MIRACL

diff --git a/evaluation/README.md b/evaluation/README.md
@@ -2,7 +2,7 @@
 
 ## Machine Translated STS-B
 
-To the best of our knowledge, there is no official benchmark on Indonesian sentence embeddings. Inspired by [Thai Sentence Vector Benchmark](https://github.com/mrpeerat/Thai-Sentence-Vector-Benchmark), we translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set. You can find the translated dataset on [🤗 HuggingFace Hub](https://huggingface.co/datasets/LazarusNLP/stsb_mt_id).
+Inspired by [Thai Sentence Vector Benchmark](https://github.com/mrpeerat/Thai-Sentence-Vector-Benchmark), we translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set. You can find the translated dataset on [🤗 HuggingFace Hub](https://huggingface.co/datasets/LazarusNLP/stsb_mt_id).
 
 For practical purposes, we used Sentence Transformer's [`EmbeddingSimilarityEvaluator`](https://www.sbert.net/docs/package_reference/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) to perform inference and evaluate our models.
 
@@ -19,6 +19,26 @@ python sts/eval_sts.py \
     --test-batch-size 32
 ```
 
+## SemRel2024: Semantic Textual Relatedness (STR)
+
+SemRel2024 is a collection of Semantic Textual Relatedness (STR) datasets for 14 languages, including African and Asian languages. The datasets are composed of sentence pairs, each assigned a relatedness score between 0 (completely) unrelated and 1 (maximally related) with a large range of expected relatedness values. SemRel2024 dataset was used as part of the SemEval2024 shared task 1. The task aims to evaluate the ability of systems to measure the semantic relatedness between two sentences.
+
+We used Sentence Transformer's [`EmbeddingSimilarityEvaluator`](https://www.sbert.net/docs/package_reference/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) to perform inference and evaluate our models.
+
+### Example
+
+```sh
+python sts/eval_sts.py \
+    --model-name LazarusNLP/congen-indobert-base \
+    --test-dataset-name SemRel/SemRel2024 \
+    --test-dataset-config ind \
+    --test-dataset-split test \
+    --test-text-column-1 sentence1 \
+    --test-text-column-2 sentence2 \
+    --test-label-column label \
+    --test-batch-size 32
+```
+
 ## MIRACL (Multilingual Information Retrieval Across a Continuum of Languages)
 
 MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world. We evaluated our models on the Indonesian subset of MIRACL.

diff --git a/evaluation/run_evaluation.sh b/evaluation/run_evaluation.sh
@@ -76,6 +76,20 @@ python sts/eval_sts.py \
     --test-batch-size 32 \
     --output-folder sts/results/$model_name
 
+for split in dev test
+do
+    python sts/eval_sts.py \
+        --model-name $model \
+        --test-dataset-name SemRel/SemRel2024 \
+        --test-dataset-config ind \
+        --test-dataset-split $split \
+        --test-text-column-1 sentence1 \
+        --test-text-column-2 sentence2 \
+        --test-label-column label \
+        --test-batch-size 32 \
+        --output-folder sts/results/$model_name
+done
+
 ###############################
 # MTEB TASKS
 ###############################

diff --git a/evaluation/sts/eval_sts.py b/evaluation/sts/eval_sts.py
@@ -11,6 +11,7 @@
 class Args:
     model_name: str = "sentence-transformers/distiluse-base-multilingual-cased-v2"
     test_dataset_name: str = "LazarusNLP/stsb_mt_id"
+    test_dataset_config: str = "default"
     test_dataset_split: str = "test"
     test_text_column_1: str = "text_1"
     test_text_column_2: str = "text_2"
@@ -25,12 +26,13 @@ def main(args: Args):
     model = SentenceTransformer(args.model_name)
 
     # Load dataset
-    test_ds = load_dataset(args.test_dataset_name, split=args.test_dataset_split)
+    test_ds = load_dataset(args.test_dataset_name, args.test_dataset_config, split=args.test_dataset_split)
+    max_label_value = float(max(test_ds[args.test_label_column]))
 
     test_data = [
         InputExample(
             texts=[data[args.test_text_column_1], data[args.test_text_column_2]],
-            label=float(data[args.test_label_column]) / 5.0,
+            label=float(data[args.test_label_column]) / max_label_value,
         )
         for data in test_ds
     ]