Skip to content

Commit

Permalink
Improve phrasing in the main article
Browse files Browse the repository at this point in the history
  • Loading branch information
scriptin committed Nov 10, 2023
1 parent e23d251 commit f6894f5
Showing 1 changed file with 7 additions and 8 deletions.
15 changes: 7 additions & 8 deletions src/pages/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -107,15 +107,14 @@ appears in ~26% of documents.

## Time distribution

The style of text, grammar, vocabulary, and usage of certain kanji
may depend on when a particular text was written. Also, texts which
Texts from different epochs may have different kanji usage patterns
due to differences in vocabulary and grammar rules. Also, texts which
discuss events of a certain time period may have statistical biases,
e.g. newspapers from 2020-2022 use COVID- and medicine-related
words and kanji more often compared to previous years.

That's why it's important to collect texts which are distributed
across wider time periods to avoid biases and have representative
datasets.
It's important to collect texts which are distributed
across wider time periods to avoid these biases.

- **Aozora**: most texts are in public domain due to
expiration of copyright terms, which is currently
Expand Down Expand Up @@ -153,8 +152,8 @@ The data in the old version was collected from the following sources:
- Twitter (now knows as X)

However, this first attempt lacked sufficient research and technical effort,
and the resulting dataset had multiple issues, described in the
[attached readme](https://github.com/scriptin/kanji-frequency/tree/master/data2015/README.md).
and the resulting dataset had multiple issues, described in the attached
[readme](https://github.com/scriptin/kanji-frequency/tree/master/data2015/README.md).

### Current version

Expand All @@ -165,7 +164,7 @@ but unfortunately has some new problems:
- Twitter API no longer has a free tier
- Changes in the organization management and staff layoffs at Twitter
resulted in insufficient content moderation.
I wanted to avoid including any hate speech in the data
I preferred to avoid including any hate speech in the data
- **News dataset is much smaller**:
- Most news on popular websites are now behind paywalls,
making it impractical and illegal to create crawlers/scrapers
Expand Down

0 comments on commit f6894f5

Please sign in to comment.