- Training Corpus is Dr. Oliver Hellwig's Sanskrit corpus.
- Upon training, the documents used for evaluation (viz., MS and KS) are excluced.
- workers (worker threads to train the model) is set to 4.
- sg (Training algorithm) is set to 1 (skip-gram).
- Other parameters are default. Regarding the other parameters, see the document.
- SentenceTransformer is used.
- GPT-2 is used as the model (default).
- Other parameters are default. Regarding the other parameters, see the document.
- chronbmm/xlm-roberta-vedic is used as the model for Sanskrit.
- See output folder.
- average.tsv is a table whose value refers to the average of cosine similarity. But note that the value in the chapter column does not refer the average, because there is only one value for chapter (or two values, if there are brāhmaṇa and mantra.)
The evaluation dataset is devided with the following units:
- chapter (average? and distribution?)
- 200 tokens (average? and distribution?)
- 100 tokens (average? and distribution?)
- 20 tokens (average? and distribution?)
- MS.1.1 vs. MS.1.6
- MS.1.6 vs. MS.1.7
- MS.1.6 vs. KS.8
- MS.1.7 vs. KS.9.1
- MS.1.9 vs. KS.9.11
- adding statistics like avg. tokens/sentence, distribution of words, etc.