Version 2.0: example sentence furigana #21
stephenmk
announced in
Announcements
Replies: 1 comment
-
This project is so wicked cool. Thank you! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The new release of Jitendex includes furigana information for about 22,000 of the 26,000 unique example sentences that appear in the dictionary.
This information was originally produced automatically by a computer program used by the maintainers of Tatoeba. Since these sorts of programs aren't perfect, there were many errors in this original data. Tatoeba allows users to review and amend the furigana information on sentences. However, only about 5,500 of the 26,000 sentences that are used in Jitendex have been marked as reviewed by Tatoeba users.
Fortunately, there is an easy way to validate this furigana data. Each example sentence in the Tanaka Corpus has a corresponding "index" string which is used for associating the sentence with particular JMdict entries.
Consider the following furigana data from Tatoeba. The reading for 出 is incorrect.
彼は物静かな人で、良家の出だった。
The "index" string for this sentence looks like this:
This string contains elements that correspond to particular JMdict entries. For each element, we can go fetch the entry and find the correct reading information. (For an element like だ{だった} that is already completely written in kana, it is not necessary to go fetch any information from the entry for 'だ'). We can then assemble these readings into a pattern.
If these kana appear in this same order in the furigana provided by Tatoeba, then we have some assurance that the furigana information is correct. The 22,000 sentences with furigana in the new version of Jitendex have all passed this validation.
This validation method isn't foolproof, and I'm sure some errors slipped through. The "index" strings themselves are not guaranteed to be correct; occasionally, index elements are tied to the wrong JMdict entry. The index strings also do not always contain complete information about the sentence; they only contain words that are linked to existing JMdict entries. The readings of many proper nouns, large numbers, and counter expressions in particular cannot be recovered using this method.
For the remaining 4,000 sentences that could not be validated, I may have to fix them manually if I cannot think of a very clever way to automate the process. I have already corrected 200 of these sentences manually. About a tenth of these sentences required fixes to the index strings as well. This is a very slow and laborious process.
Keyword highlighting in example sentences
The example sentence data distributed by the EDRDG includes a "keyword" value with each sentence. For example, the keyword for the sentence "算数は数を取り扱う" is "数." However, "数" appears twice in this sentence, and the data does not indicate which instance corresponds to the "数" that we are interested in.
There are over 200 such sentences in which the keyword value appears more than once. I manually reviewed each sentence and noted the correct instances of the keyword values. The new version of Jitendex only underlines these correct values.
Before and after screenshots of the entry for 数
This can make quite a big difference in the helpfulness of these sentences. For example, "の" can appear in a sentence several times and serve several different purposes, but the point of an example sentence is to illustrate one particular purpose.
Example sentences in the entry for の displayed in GoldenDict-ng
Beta Was this translation helpful? Give feedback.
All reactions