Skip to content

Commit

Permalink
Added SemRel2024
Browse files Browse the repository at this point in the history
  • Loading branch information
w11wo committed Jul 4, 2024
1 parent 35941ca commit 276222e
Show file tree
Hide file tree
Showing 7 changed files with 116 additions and 8 deletions.
28 changes: 27 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,12 @@ Inspired by [Thai Sentence Vector Benchmark](https://github.com/mrpeerat/Thai-Se

### Semantic Textual Similarity

We believe that a synthetic baseline is better than no baseline. Therefore, we followed approached done in the Thai Sentence Vector Benchmark project and translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set.
We followed approached done in the Thai Sentence Vector Benchmark project and translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set.

> You can find the translated dataset on [🤗 HuggingFace Hub](https://huggingface.co/datasets/LazarusNLP/stsb_mt_id).
Further, we will similarly be evaluating our models on the [SemRel2024](https://huggingface.co/datasets/SemRel/SemRel2024) dataset which contains human-annotated, Indonesian semantic textual relatedness (STR) data. The dataset consists of two splits: `dev` and `test`. We will be evaluating our models' Spearman correlation score on both splits.

### Retrieval

To evaluate our models' capability to perform retrieval tasks, we evaluate them on Indonesian subsets of MIRACL and TyDiQA datasets. In both datasets, the model's ability to retrieve relevant documents given a query is tested. We employ R@1 (top-1 accuracy), MRR@10, and nDCG@10 metrics to measure our model's performance.
Expand Down Expand Up @@ -112,6 +114,30 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S
| [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 79.72 |
| [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 79.44 |

#### SemRel2024: Semantic Textual Relatedness (STR)

| Model | `dev` Spearman's Correlation (%) ↑ | `test` Spearman's Correlation (%) ↑ |
| --------------------------------------------------------------------------------------------------------------------------- | :--------------------------------: | :---------------------------------: |
| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 30.64 | 36.77 |
| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 35.95 | 41.73 |
| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 35.05 | 39.14 |
| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 33.71 | 37.73 |
| [ConGen-Indo-e5 Small](https://huggingface.co/LazarusNLP/congen-indo-e5-small) | 36.35 | 42.47 |
| [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 41.50 | 43.25 |
| [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | **42.87** | 38.78 |
| [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 41.68 | 40.42 |
| [all-IndoBERT Base-v4](https://huggingface.co/LazarusNLP/all-indobert-base-v4) | 41.38 | 38.05 |
| [all-NusaBERT Base-v4](https://huggingface.co/LazarusNLP/all-nusabert-base-v4) | 42.11 | 41.55 |
| [all-NusaBERT Large-v4](https://huggingface.co/LazarusNLP/all-nusabert-large-v4) | 40.21 | 42.25 |
| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 39.79 | 43.85 |
| [all-Indo-e5 Small-v3](https://huggingface.co/LazarusNLP/all-indo-e5-small-v3) | 40.25 | 42.60 |
| [all-Indo-e5 Small-v4](https://huggingface.co/LazarusNLP/all-indo-e5-small-v4) | 40.20 | 42.90 |
| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 37.22 | 49.35 |
| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 34.56 | 37.51 |
| [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 41.92 | **49.60** |
| [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 41.29 | 45.04 |
| [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 39.20 | 45.04 |

### Retrieval

#### MIRACL
Expand Down
22 changes: 21 additions & 1 deletion docs/evaluation/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Machine Translated STS-B

To the best of our knowledge, there is no official benchmark on Indonesian sentence embeddings. Inspired by [Thai Sentence Vector Benchmark](https://github.com/mrpeerat/Thai-Sentence-Vector-Benchmark), we translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set. You can find the translated dataset on [🤗 HuggingFace Hub](https://huggingface.co/datasets/LazarusNLP/stsb_mt_id).
Inspired by [Thai Sentence Vector Benchmark](https://github.com/mrpeerat/Thai-Sentence-Vector-Benchmark), we translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set. You can find the translated dataset on [🤗 HuggingFace Hub](https://huggingface.co/datasets/LazarusNLP/stsb_mt_id).

For practical purposes, we used Sentence Transformer's [`EmbeddingSimilarityEvaluator`](https://www.sbert.net/docs/package_reference/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) to perform inference and evaluate our models.

Expand All @@ -19,6 +19,26 @@ python sts/eval_sts.py \
--test-batch-size 32
```

## SemRel2024: Semantic Textual Relatedness (STR)

SemRel2024 is a collection of Semantic Textual Relatedness (STR) datasets for 14 languages, including African and Asian languages. The datasets are composed of sentence pairs, each assigned a relatedness score between 0 (completely) unrelated and 1 (maximally related) with a large range of expected relatedness values. SemRel2024 dataset was used as part of the SemEval2024 shared task 1. The task aims to evaluate the ability of systems to measure the semantic relatedness between two sentences.

We used Sentence Transformer's [`EmbeddingSimilarityEvaluator`](https://www.sbert.net/docs/package_reference/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) to perform inference and evaluate our models.

### Example

```sh
python sts/eval_sts.py \
--model-name LazarusNLP/congen-indobert-base \
--test-dataset-name SemRel/SemRel2024 \
--test-dataset-config ind \
--test-dataset-split test \
--test-text-column-1 sentence1 \
--test-text-column-2 sentence2 \
--test-label-column label \
--test-batch-size 32
```

## MIRACL (Multilingual Information Retrieval Across a Continuum of Languages)

MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world. We evaluated our models on the Indonesian subset of MIRACL.
Expand Down
28 changes: 27 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,12 @@ Inspired by [Thai Sentence Vector Benchmark](https://github.com/mrpeerat/Thai-Se

### Semantic Textual Similarity

We believe that a synthetic baseline is better than no baseline. Therefore, we followed approached done in the Thai Sentence Vector Benchmark project and translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set.
We followed approached done in the Thai Sentence Vector Benchmark project and translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set.

> You can find the translated dataset on [🤗 HuggingFace Hub](https://huggingface.co/datasets/LazarusNLP/stsb_mt_id).
Further, we will similarly be evaluating our models on the [SemRel2024](https://huggingface.co/datasets/SemRel/SemRel2024) dataset which contains human-annotated, Indonesian semantic textual relatedness (STR) data. The dataset consists of two splits: `dev` and `test`. We will be evaluating our models' Spearman correlation score on both splits.

### Retrieval

To evaluate our models' capability to perform retrieval tasks, we evaluate them on Indonesian subsets of MIRACL and TyDiQA datasets. In both datasets, the model's ability to retrieve relevant documents given a query is tested. We employ R@1 (top-1 accuracy), MRR@10, and nDCG@10 metrics to measure our model's performance.
Expand Down Expand Up @@ -109,6 +111,30 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S
| [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 79.72 |
| [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 79.44 |

#### SemRel2024: Semantic Textual Relatedness (STR)

| Model | `dev` Spearman's Correlation (%) ↑ | `test` Spearman's Correlation (%) ↑ |
| --------------------------------------------------------------------------------------------------------------------------- | :--------------------------------: | :---------------------------------: |
| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 30.64 | 36.77 |
| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 35.95 | 41.73 |
| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 35.05 | 39.14 |
| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 33.71 | 37.73 |
| [ConGen-Indo-e5 Small](https://huggingface.co/LazarusNLP/congen-indo-e5-small) | 36.35 | 42.47 |
| [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 41.50 | 43.25 |
| [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | **42.87** | 38.78 |
| [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 41.68 | 40.42 |
| [all-IndoBERT Base-v4](https://huggingface.co/LazarusNLP/all-indobert-base-v4) | 41.38 | 38.05 |
| [all-NusaBERT Base-v4](https://huggingface.co/LazarusNLP/all-nusabert-base-v4) | 42.11 | 41.55 |
| [all-NusaBERT Large-v4](https://huggingface.co/LazarusNLP/all-nusabert-large-v4) | 40.21 | 42.25 |
| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 39.79 | 43.85 |
| [all-Indo-e5 Small-v3](https://huggingface.co/LazarusNLP/all-indo-e5-small-v3) | 40.25 | 42.60 |
| [all-Indo-e5 Small-v4](https://huggingface.co/LazarusNLP/all-indo-e5-small-v4) | 40.20 | 42.90 |
| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 37.22 | 49.35 |
| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 34.56 | 37.51 |
| [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 41.92 | **49.60** |
| [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 41.29 | 45.04 |
| [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 39.20 | 45.04 |

### Retrieval

#### MIRACL
Expand Down
22 changes: 21 additions & 1 deletion evaluation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Machine Translated STS-B

To the best of our knowledge, there is no official benchmark on Indonesian sentence embeddings. Inspired by [Thai Sentence Vector Benchmark](https://github.com/mrpeerat/Thai-Sentence-Vector-Benchmark), we translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set. You can find the translated dataset on [🤗 HuggingFace Hub](https://huggingface.co/datasets/LazarusNLP/stsb_mt_id).
Inspired by [Thai Sentence Vector Benchmark](https://github.com/mrpeerat/Thai-Sentence-Vector-Benchmark), we translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set. You can find the translated dataset on [🤗 HuggingFace Hub](https://huggingface.co/datasets/LazarusNLP/stsb_mt_id).

For practical purposes, we used Sentence Transformer's [`EmbeddingSimilarityEvaluator`](https://www.sbert.net/docs/package_reference/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) to perform inference and evaluate our models.

Expand All @@ -19,6 +19,26 @@ python sts/eval_sts.py \
--test-batch-size 32
```

## SemRel2024: Semantic Textual Relatedness (STR)

SemRel2024 is a collection of Semantic Textual Relatedness (STR) datasets for 14 languages, including African and Asian languages. The datasets are composed of sentence pairs, each assigned a relatedness score between 0 (completely) unrelated and 1 (maximally related) with a large range of expected relatedness values. SemRel2024 dataset was used as part of the SemEval2024 shared task 1. The task aims to evaluate the ability of systems to measure the semantic relatedness between two sentences.

We used Sentence Transformer's [`EmbeddingSimilarityEvaluator`](https://www.sbert.net/docs/package_reference/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) to perform inference and evaluate our models.

### Example

```sh
python sts/eval_sts.py \
--model-name LazarusNLP/congen-indobert-base \
--test-dataset-name SemRel/SemRel2024 \
--test-dataset-config ind \
--test-dataset-split test \
--test-text-column-1 sentence1 \
--test-text-column-2 sentence2 \
--test-label-column label \
--test-batch-size 32
```

## MIRACL (Multilingual Information Retrieval Across a Continuum of Languages)

MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world. We evaluated our models on the Indonesian subset of MIRACL.
Expand Down
14 changes: 14 additions & 0 deletions evaluation/run_evaluation.sh
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,20 @@ python sts/eval_sts.py \
--test-batch-size 32 \
--output-folder sts/results/$model_name

for split in dev test
do
python sts/eval_sts.py \
--model-name $model \
--test-dataset-name SemRel/SemRel2024 \
--test-dataset-config ind \
--test-dataset-split $split \
--test-text-column-1 sentence1 \
--test-text-column-2 sentence2 \
--test-label-column label \
--test-batch-size 32 \
--output-folder sts/results/$model_name
done

###############################
# MTEB TASKS
###############################
Expand Down
6 changes: 4 additions & 2 deletions evaluation/sts/eval_sts.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
class Args:
model_name: str = "sentence-transformers/distiluse-base-multilingual-cased-v2"
test_dataset_name: str = "LazarusNLP/stsb_mt_id"
test_dataset_config: str = "default"
test_dataset_split: str = "test"
test_text_column_1: str = "text_1"
test_text_column_2: str = "text_2"
Expand All @@ -25,12 +26,13 @@ def main(args: Args):
model = SentenceTransformer(args.model_name)

# Load dataset
test_ds = load_dataset(args.test_dataset_name, split=args.test_dataset_split)
test_ds = load_dataset(args.test_dataset_name, args.test_dataset_config, split=args.test_dataset_split)
max_label_value = float(max(test_ds[args.test_label_column]))

test_data = [
InputExample(
texts=[data[args.test_text_column_1], data[args.test_text_column_2]],
label=float(data[args.test_label_column]) / 5.0,
label=float(data[args.test_label_column]) / max_label_value,
)
for data in test_ds
]
Expand Down
Loading

0 comments on commit 276222e

Please sign in to comment.