Version 2.0: example sentence furigana #21

stephenmk · 2023-11-09T21:45:19Z

stephenmk
Nov 9, 2023
Maintainer

The new release of Jitendex includes furigana information for about 22,000 of the 26,000 unique example sentences that appear in the dictionary.

This information was originally produced automatically by a computer program used by the maintainers of Tatoeba. Since these sorts of programs aren't perfect, there were many errors in this original data. Tatoeba allows users to review and amend the furigana information on sentences. However, only about 5,500 of the 26,000 sentences that are used in Jitendex have been marked as reviewed by Tatoeba users.

Fortunately, there is an easy way to validate this furigana data. Each example sentence in the Tanaka Corpus has a corresponding "index" string which is used for associating the sentence with particular JMdict entries.

Consider the following furigana data from Tatoeba. The reading for 出 is incorrect.

彼かれは物もの静しずかな人ひとで、良りょう家けの出だしだった。

The "index" string for this sentence looks like this:

彼(かれ)[01] は物静か{物静かな}~ 人(ひと) で(#2028980) 良家の出(で) だ{だった}

This string contains elements that correspond to particular JMdict entries. For each element, we can go fetch the entry and find the correct reading information. (For an element like だ{だった} that is already completely written in kana, it is not necessary to go fetch any information from the entry for 'だ'). We can then assemble these readings into a pattern.

かれ　は　しずか　ひと　で　りょうけ　の　で　だった

If these kana appear in this same order in the furigana provided by Tatoeba, then we have some assurance that the furigana information is correct. The 22,000 sentences with furigana in the new version of Jitendex have all passed this validation.

This validation method isn't foolproof, and I'm sure some errors slipped through. The "index" strings themselves are not guaranteed to be correct; occasionally, index elements are tied to the wrong JMdict entry. The index strings also do not always contain complete information about the sentence; they only contain words that are linked to existing JMdict entries. The readings of many proper nouns, large numbers, and counter expressions in particular cannot be recovered using this method.

For the remaining 4,000 sentences that could not be validated, I may have to fix them manually if I cannot think of a very clever way to automate the process. I have already corrected 200 of these sentences manually. About a tenth of these sentences required fixes to the index strings as well. This is a very slow and laborious process.

Keyword highlighting in example sentences

The example sentence data distributed by the EDRDG includes a "keyword" value with each sentence. For example, the keyword for the sentence "算数は数を取り扱う" is "数." However, "数" appears twice in this sentence, and the data does not indicate which instance corresponds to the "数" that we are interested in.

There are over 200 such sentences in which the keyword value appears more than once. I manually reviewed each sentence and noted the correct instances of the keyword values. The new version of Jitendex only underlines these correct values.

Before and after screenshots of the entry for 数

Before	After

This can make quite a big difference in the helpfulness of these sentences. For example, "の" can appear in a sentence several times and serve several different purposes, but the point of an example sentence is to illustrate one particular purpose.

Example sentences in the entry for の displayed in GoldenDict-ng

Kimeiga · 2024-02-21T20:39:26Z

Kimeiga
Feb 21, 2024

This project is so wicked cool. Thank you!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 2.0: example sentence furigana #21

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Version 2.0: example sentence furigana #21

stephenmk Nov 9, 2023 Maintainer

彼かれは物もの静しずかな人ひとで、良りょう家けの出だしだった。

Keyword highlighting in example sentences

Replies: 1 comment

Kimeiga Feb 21, 2024

stephenmk
Nov 9, 2023
Maintainer

Kimeiga
Feb 21, 2024