-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternate Frequency Corpuses - Suggestions welcome #3
Comments
Frequency Corpuses research (WIP):
This one actually has both number of occurences and frequency ranking ("nth most common word"), which is nice.
|
I've just found another resource on PyPI (toiro). from toiro import datadownloader, tokenizers
# A list of avaliable corpora in toiro
corpora = datadownloader.available_corpus()
print(corpora)
# => ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews', 'chABSA_dataset']
available_tokenizers = tokenizers.available_tokenizers()
print(available_tokenizers.keys())
# => dict_keys(['nagisa', 'janome', 'mecab-python3', 'sudachipy', 'spacy', 'ginza', 'kytea', 'jumanpp', 'sentencepiece', 'fugashi-ipadic', 'tinysegmenter', 'fugashi-unidic']) I am not really sure if segmenters are required, but I added them just in case. |
BCCWJ corpus is now added (separate URL for now, 5.8MB download): main commit: b64a447 This is the Balanced Corpus of Contemporary Written Japanese, which uses relative frequency (100 = 100th most common word) instead of absolute frequency like InnocentCorpus (100 = occurs 100 times in these 5000 books). In my experience, both corpuses have some interesting differences and common words missing, so they supplement each other very well in my Anki cards. @patarapolw tagging you in case you're interested. Still open for more corpus suggestions! Currently, the BCCWJ corpus just needs its own We could unify the HTML into one page as well, but then we'd need radio buttons to choose the corpus, |
InnocentCampus is great, but with sourcing ~5000 novels, it's a little small (though mostly sufficient), and specific to books.
Alternatives would be sources using News, Movies, Wikipedia, Twitter, etc.
Ideally, we would be able to use any of these frequencies, by user choice (or insert all of them into different fields).
There's a popular Anime/J-Drama corpus -> research. (see next comment)
(It ranks words differently though, 1 = most common word, 2 = 2nd most common word, etc. - inverse relationship to InnocentCampus frequency, which is number of occurences. This actually lead to confusions in forums)
It's easy enough to exchange the corpus, right now it's just a Javascript object (Hashmap) called
innocent_terms_complete
inserted as a global variable through a<script>
tag in the index.html.It was generated via
tools/parseCorpus.js
.So currently
frequency = innocent_terms_complete[word]
. (more or less)Note for Latin nerds: the Latin plural is corpora, but corpuses is also an allowed plural in English. I love Latin, but English is not Latin.
Also, the plural of octopus is octopuses, not octopi, because it's a Greek word, not Latin, the Greek plural being octopodes. But the dictionary is generous and accepts all 3, reflecting common usage.
The text was updated successfully, but these errors were encountered: