Alternate Frequency Corpuses - Suggestions welcome #3

sschmidTU · 2021-09-10T14:34:17Z

InnocentCampus is great, but with sourcing ~5000 novels, it's a little small (though mostly sufficient), and specific to books.
Alternatives would be sources using News, Movies, Wikipedia, Twitter, etc.
Ideally, we would be able to use any of these frequencies, by user choice (or insert all of them into different fields).

There's a popular Anime/J-Drama corpus -> research. (see next comment)
(It ranks words differently though, 1 = most common word, 2 = 2nd most common word, etc. - inverse relationship to InnocentCampus frequency, which is number of occurences. This actually lead to confusions in forums)

It's easy enough to exchange the corpus, right now it's just a Javascript object (Hashmap) called innocent_terms_complete inserted as a global variable through a <script> tag in the index.html.
It was generated via tools/parseCorpus.js.
So currently frequency = innocent_terms_complete[word]. (more or less)

Note for Latin nerds: the Latin plural is corpora, but corpuses is also an allowed plural in English. I love Latin, but English is not Latin.
_{^{Also, the plural of octopus is octopuses, not octopi, because it's a Greek word, not Latin, the Greek plural being octopodes. But the dictionary is generous and accepts all 3, reflecting common usage.}}

The text was updated successfully, but these errors were encountered:

sschmidTU · 2021-09-10T15:01:13Z

Frequency Corpuses research (WIP):

The Anime/JDrama frequency corpus may be this one, using ~12000 Anime+Drama subtitle files:
https://github.com/chriskempson/japanese-subtitles-word-frequency-list

This one actually has both number of occurences and frequency ranking ("nth most common word"), which is nice.

There's also one using ~200 anime shows, which is too small a sample for my tastes, but has some interesting findings ("the top ~900 kanji make 90% of kanji occurences, top ~1900 make 99%"):
https://www.reddit.com/r/LearnJapanese/comments/crlsqj/googlesheet_anime_frequency_list/

patarapolw · 2022-03-16T19:50:48Z

I've just found another resource on PyPI (toiro).

from toiro import datadownloader, tokenizers

# A list of avaliable corpora in toiro
corpora = datadownloader.available_corpus()
print(corpora)
# => ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews', 'chABSA_dataset']

available_tokenizers = tokenizers.available_tokenizers()
print(available_tokenizers.keys())
# => dict_keys(['nagisa', 'janome', 'mecab-python3', 'sudachipy', 'spacy', 'ginza', 'kytea', 'jumanpp', 'sentencepiece', 'fugashi-ipadic', 'tinysegmenter', 'fugashi-unidic'])

I am not really sure if segmenters are required, but I added them just in case.

sschmidTU · 2022-06-05T17:26:36Z

BCCWJ corpus is now added (separate URL for now, 5.8MB download):
https://sschmidtu.github.io/anki-frequency-inserter/index_BCCWJ.html?expressionFieldName=Expression&frequencyFieldName=FrequencyBCCWJ

main commit: b64a447
unification: 36aba62

This is the Balanced Corpus of Contemporary Written Japanese, which uses relative frequency (100 = 100th most common word) instead of absolute frequency like InnocentCorpus (100 = occurs 100 times in these 5000 books).
(we could also convert InnocentCorpus to relative frequency via code if desired by the user)

In my experience, both corpuses have some interesting differences and common words missing, so they supplement each other very well in my Anki cards.
I wrote more about it in a post on Wanikani.

@patarapolw tagging you in case you're interested.

Still open for more corpus suggestions!
(The ones from PyPI sound interesting, just didn't get to take a look yet)

Currently, the BCCWJ corpus just needs its own index_BCCWJ.html (and corpus terms_BCCWJ.js), the main code is now unified in frequencyInserter.js and can be easily expanded for more corpuses.

We could unify the HTML into one page as well, but then we'd need radio buttons to choose the corpus,
and more importantly to load the corpus after page load on user click, so that the user doesn't have to download all corpuses at once. (Innocent is ~1.7MB zipped, BCCWJ ~5.8MB)
This might take longer for the user, and we'd need to dynamically require the corpus .js or something.

sschmidTU changed the title ~~Alternate Corpus (Corpi?) - Suggestions welcome~~ Alternate Frequency Corpus (Corpi?) - Suggestions welcome Sep 10, 2021

sschmidTU changed the title ~~Alternate Frequency Corpus (Corpi?) - Suggestions welcome~~ Alternate Frequency Corpuses - Suggestions welcome Sep 10, 2021

sschmidTU added the good first issue Good for newcomers label Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternate Frequency Corpuses - Suggestions welcome #3

Alternate Frequency Corpuses - Suggestions welcome #3

sschmidTU commented Sep 10, 2021 •

edited

Loading

sschmidTU commented Sep 10, 2021 •

edited

Loading

patarapolw commented Mar 16, 2022

sschmidTU commented Jun 5, 2022 •

edited

Loading

Alternate Frequency Corpuses - Suggestions welcome #3

Alternate Frequency Corpuses - Suggestions welcome #3

Comments

sschmidTU commented Sep 10, 2021 • edited Loading

sschmidTU commented Sep 10, 2021 • edited Loading

patarapolw commented Mar 16, 2022

sschmidTU commented Jun 5, 2022 • edited Loading

sschmidTU commented Sep 10, 2021 •

edited

Loading

sschmidTU commented Sep 10, 2021 •

edited

Loading

sschmidTU commented Jun 5, 2022 •

edited

Loading