Skip to content

ykyogoku/Amano2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Amano2

Training Corpus

Models

Word2Vec

  • workers (worker threads to train the model) is set to 4.
  • sg (Training algorithm) is set to 1 (skip-gram).
  • Other parameters are default. Regarding the other parameters, see the document.

Transformers

  • SentenceTransformer is used.
  • GPT-2 is used as the model (default).
  • Other parameters are default. Regarding the other parameters, see the document.
  • chronbmm/xlm-roberta-vedic is used as the model for Sanskrit.

Evaluation

  • See output folder.
  • average.tsv is a table whose value refers to the average of cosine similarity. But note that the value in the chapter column does not refer the average, because there is only one value for chapter (or two values, if there are brāhmaṇa and mantra.)

Corpus Division

The evaluation dataset is devided with the following units:

  • chapter (average? and distribution?)
  • 200 tokens (average? and distribution?)
  • 100 tokens (average? and distribution?)
  • 20 tokens (average? and distribution?)

Chapters to be compared

  1. MS.1.1 vs. MS.1.6
  2. MS.1.6 vs. MS.1.7
  3. MS.1.6 vs. KS.8
  4. MS.1.7 vs. KS.9.1
  5. MS.1.9 vs. KS.9.11

TODO

  • adding statistics like avg. tokens/sentence, distribution of words, etc.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages