Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate Frequency Corpuses - Suggestions welcome #3

Open
sschmidTU opened this issue Sep 10, 2021 · 3 comments
Open

Alternate Frequency Corpuses - Suggestions welcome #3

sschmidTU opened this issue Sep 10, 2021 · 3 comments
Labels
good first issue Good for newcomers

Comments

@sschmidTU
Copy link
Owner

sschmidTU commented Sep 10, 2021

InnocentCampus is great, but with sourcing ~5000 novels, it's a little small (though mostly sufficient), and specific to books.
Alternatives would be sources using News, Movies, Wikipedia, Twitter, etc.
Ideally, we would be able to use any of these frequencies, by user choice (or insert all of them into different fields).

There's a popular Anime/J-Drama corpus -> research. (see next comment)
(It ranks words differently though, 1 = most common word, 2 = 2nd most common word, etc. - inverse relationship to InnocentCampus frequency, which is number of occurences. This actually lead to confusions in forums)

It's easy enough to exchange the corpus, right now it's just a Javascript object (Hashmap) called innocent_terms_complete inserted as a global variable through a <script> tag in the index.html.
It was generated via tools/parseCorpus.js.
So currently frequency = innocent_terms_complete[word]. (more or less)

Note for Latin nerds: the Latin plural is corpora, but corpuses is also an allowed plural in English. I love Latin, but English is not Latin.
Also, the plural of octopus is octopuses, not octopi, because it's a Greek word, not Latin, the Greek plural being octopodes. But the dictionary is generous and accepts all 3, reflecting common usage.

@sschmidTU sschmidTU changed the title Alternate Corpus (Corpi?) - Suggestions welcome Alternate Frequency Corpus (Corpi?) - Suggestions welcome Sep 10, 2021
@sschmidTU sschmidTU changed the title Alternate Frequency Corpus (Corpi?) - Suggestions welcome Alternate Frequency Corpuses - Suggestions welcome Sep 10, 2021
@sschmidTU sschmidTU added the good first issue Good for newcomers label Sep 10, 2021
@sschmidTU
Copy link
Owner Author

sschmidTU commented Sep 10, 2021

Frequency Corpuses research (WIP):

This one actually has both number of occurences and frequency ranking ("nth most common word"), which is nice.

@patarapolw
Copy link

I've just found another resource on PyPI (toiro).

from toiro import datadownloader, tokenizers

# A list of avaliable corpora in toiro
corpora = datadownloader.available_corpus()
print(corpora)
# => ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews', 'chABSA_dataset']

available_tokenizers = tokenizers.available_tokenizers()
print(available_tokenizers.keys())
# => dict_keys(['nagisa', 'janome', 'mecab-python3', 'sudachipy', 'spacy', 'ginza', 'kytea', 'jumanpp', 'sentencepiece', 'fugashi-ipadic', 'tinysegmenter', 'fugashi-unidic'])

I am not really sure if segmenters are required, but I added them just in case.

@sschmidTU
Copy link
Owner Author

sschmidTU commented Jun 5, 2022

BCCWJ corpus is now added (separate URL for now, 5.8MB download):
https://sschmidtu.github.io/anki-frequency-inserter/index_BCCWJ.html?expressionFieldName=Expression&frequencyFieldName=FrequencyBCCWJ

main commit: b64a447
unification: 36aba62

This is the Balanced Corpus of Contemporary Written Japanese, which uses relative frequency (100 = 100th most common word) instead of absolute frequency like InnocentCorpus (100 = occurs 100 times in these 5000 books).
(we could also convert InnocentCorpus to relative frequency via code if desired by the user)

In my experience, both corpuses have some interesting differences and common words missing, so they supplement each other very well in my Anki cards.
I wrote more about it in a post on Wanikani.

@patarapolw tagging you in case you're interested.

Still open for more corpus suggestions!
(The ones from PyPI sound interesting, just didn't get to take a look yet)

Currently, the BCCWJ corpus just needs its own index_BCCWJ.html (and corpus terms_BCCWJ.js), the main code is now unified in frequencyInserter.js and can be easily expanded for more corpuses.

We could unify the HTML into one page as well, but then we'd need radio buttons to choose the corpus,
and more importantly to load the corpus after page load on user click, so that the user doesn't have to download all corpuses at once. (Innocent is ~1.7MB zipped, BCCWJ ~5.8MB)
This might take longer for the user, and we'd need to dynamically require the corpus .js or something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants