Amano2

Training Corpus

Training Corpus is Dr. Oliver Hellwig's Sanskrit corpus.
Upon training, the documents used for evaluation (viz., MS and KS) are excluced.

Models

Word2Vec

workers (worker threads to train the model) is set to 4.
sg (Training algorithm) is set to 1 (skip-gram).
Other parameters are default. Regarding the other parameters, see the document.

Transformers

SentenceTransformer is used.
GPT-2 is used as the model (default).
Other parameters are default. Regarding the other parameters, see the document.
chronbmm/xlm-roberta-vedic is used as the model for Sanskrit.

Evaluation

See output folder.
average.tsv is a table whose value refers to the average of cosine similarity. But note that the value in the chapter column does not refer the average, because there is only one value for chapter (or two values, if there are brāhmaṇa and mantra.)

Corpus Division

The evaluation dataset is devided with the following units:

chapter (average? and distribution?)
200 tokens (average? and distribution?)
100 tokens (average? and distribution?)
20 tokens (average? and distribution?)

Chapters to be compared

MS.1.1 vs. MS.1.6
MS.1.6 vs. MS.1.7
MS.1.6 vs. KS.8
MS.1.7 vs. KS.9.1
MS.1.9 vs. KS.9.11

TODO

adding statistics like avg. tokens/sentence, distribution of words, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
output		output
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amano2

Training Corpus

Models

Word2Vec

Transformers

Evaluation

Corpus Division

Chapters to be compared

TODO

About

Releases

Packages

Languages

ykyogoku/Amano2

Folders and files

Latest commit

History

Repository files navigation

Amano2

Training Corpus

Models

Word2Vec

Transformers

Evaluation

Corpus Division

Chapters to be compared

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages