@@ -4,21 +4,25 @@ Inspired by [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-
4
4
5
5
## Training Data
6
6
7
- | Dataset | Task | Data Instance | Number of Training Tuples |
8
- | ------------------------------------------------------------------------------------------------------------------ | :----------------------------: | :-------------------------------------------: | :-----------------------: |
9
- | [ indonli] ( https://huggingface.co/datasets/indonli ) | Natural Language Inference | ` (premise, entailment, contradiction) ` | 3,914 |
10
- | [ indolem/indo_story_cloze] ( https://huggingface.co/datasets/indolem/indo_story_cloze ) | Commonsense Reasoning | ` (context, correct ending, incorrect ending) ` | 1,000 |
11
- | [ unicamp-dl/mmarco] ( https://huggingface.co/datasets/unicamp-dl/mmarco ) | Passage Retrieval | ` (query, positive passage, negative passage) ` | 100,000 |
12
- | [ miracl/miracl] ( https://huggingface.co/datasets/miracl/miracl ) | Passage Retrieval | ` (query, positive passage, negative passage) ` | 8,086 |
13
- | [ SEACrowd/wrete] ( https://huggingface.co/datasets/SEACrowd/wrete ) | Textual Entailment | ` (sentenceA, sentenceB) ` | 183 |
14
- | [ SEACrowd/indolem_ntp] ( https://huggingface.co/datasets/SEACrowd/indolem_ntp ) | Textual Entailment | ` (tweet, next tweet) ` | 5,681 |
15
- | [ khalidalt/tydiqa-goldp] ( https://huggingface.co/datasets/khalidalt/tydiqa-goldp ) | Extractive Question-Answering | ` (question, passage) ` , ` (question, answer) ` | 11,404 |
16
- | [ SEACrowd/facqa] ( https://huggingface.co/datasets/SEACrowd/facqa ) | Extractive Question-Answering | ` (question, passage) ` , ` (question, answer) ` | 4,990 |
17
- | * included in v2* |
18
- | [ indonesian-nlp/lfqa_id] ( https://huggingface.co/datasets/indonesian-nlp/lfqa_id ) | Open-domain Question-Answering | ` (question, answer) ` | 226,147 |
19
- | [ jakartaresearch/indoqa] ( https://huggingface.co/datasets/jakartaresearch/indoqa ) | Extractive Question-Answering | ` (question, passage) ` , ` (question, answer) ` | 6,498 |
20
- | [ jakartaresearch/id-paraphrase-detection] ( https://huggingface.co/datasets/jakartaresearch/id-paraphrase-detection ) | Paraphrase | ` (sentence, rephrased sentence) ` | 4,076 |
21
- | ** Total** | | | ** 371,979** |
7
+ | Dataset | Task | Data Instance | Number of Training Tuples |
8
+ | -------------------------------------------------------------------------------------------------------------------------- | :----------------------------: | :-------------------------------------------: | :-----------------------: |
9
+ | [ indonli] ( https://huggingface.co/datasets/indonli ) | Natural Language Inference | ` (premise, entailment, contradiction) ` | 3,914 |
10
+ | [ indolem/indo_story_cloze] ( https://huggingface.co/datasets/indolem/indo_story_cloze ) | Commonsense Reasoning | ` (context, correct ending, incorrect ending) ` | 1,000 |
11
+ | [ unicamp-dl/mmarco] ( https://huggingface.co/datasets/unicamp-dl/mmarco ) | Passage Retrieval | ` (query, positive passage, negative passage) ` | 100,000 |
12
+ | [ miracl/miracl] ( https://huggingface.co/datasets/miracl/miracl ) | Passage Retrieval | ` (query, positive passage, negative passage) ` | 8,086 |
13
+ | [ SEACrowd/wrete] ( https://huggingface.co/datasets/SEACrowd/wrete ) | Textual Entailment | ` (sentenceA, sentenceB) ` | 183 |
14
+ | [ SEACrowd/indolem_ntp] ( https://huggingface.co/datasets/SEACrowd/indolem_ntp ) | Textual Entailment | ` (tweet, next tweet) ` | 5,681 |
15
+ | [ khalidalt/tydiqa-goldp] ( https://huggingface.co/datasets/khalidalt/tydiqa-goldp ) | Extractive Question-Answering | ` (question, passage) ` , ` (question, answer) ` | 11,404 |
16
+ | [ SEACrowd/facqa] ( https://huggingface.co/datasets/SEACrowd/facqa ) | Extractive Question-Answering | ` (question, passage) ` , ` (question, answer) ` | 4,990 |
17
+ | * included in v2* |
18
+ | [ indonesian-nlp/lfqa_id] ( https://huggingface.co/datasets/indonesian-nlp/lfqa_id ) | Open-domain Question-Answering | ` (question, answer) ` | 226,147 |
19
+ | [ jakartaresearch/indoqa] ( https://huggingface.co/datasets/jakartaresearch/indoqa ) | Extractive Question-Answering | ` (question, passage) ` , ` (question, answer) ` | 6,498 |
20
+ | [ jakartaresearch/id-paraphrase-detection] ( https://huggingface.co/datasets/jakartaresearch/id-paraphrase-detection ) | Paraphrase | ` (sentence, rephrased sentence) ` | 4,076 |
21
+ | * included in v3* |
22
+ | [ LazarusNLP/multilingual-NLI-26lang-2mil7-id] ( https://huggingface.co/datasets/LazarusNLP/multilingual-NLI-26lang-2mil7-id ) | Natural Language Inference | ` (premise, entailment, hypothesis) ` | 41,924 |
23
+ | * included in v4* |
24
+ | [ nthakur/swim-ir-monolingual] ( https://huggingface.co/datasets/nthakur/swim-ir-monolingual ) | Passage Retrieval | ` (query, positive passage, negative passage) ` | 227,145 |
25
+ | ** Total** | | | ** 641,048** |
22
26
23
27
## All Supervised Datasets with MultipleNegativesRankingLoss
24
28
@@ -46,6 +50,21 @@ python train_all_mnrl.py \
46
50
--learning-rate 2e-5
47
51
```
48
52
53
+ ## All Supervised Datasets with CachedMultipleNegativesRankingLoss
54
+
55
+ ### IndoBERT Base
56
+
57
+ ``` sh
58
+ python train_all_mnrl.py \
59
+ --model-name indobenchmark/indobert-base-p1 \
60
+ --max-seq-length 128 \
61
+ --num-epochs 5 \
62
+ --train-batch-size-pairs 384 \
63
+ --train-batch-size-triplets 256 \
64
+ --mini-batch-size 320 \
65
+ --learning-rate 2e-5
66
+ ```
67
+
49
68
## References
50
69
51
70
``` bibtex
0 commit comments