Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example Sentences (from Tanaka Corpus) sometimes linked to incorrect sense (ex: in 掛ける entry) #79

Closed
Kimeiga opened this issue Feb 21, 2024 · 4 comments
Assignees
Labels
question Further information is requested

Comments

@Kimeiga
Copy link

Kimeiga commented Feb 21, 2024

Hi Stephen! Amazing work, thank you for contributing to the world's knowledge!

I have noticed some issues with the Tanaka Corpus, and am not sure where to discuss this, but since I intend to use Yomitan as my popup dictionary of choice for some time, figured I would mention it here. This problem comes up in other projects that use the Tanaka Corpus of course (e.g. Shirabe Jisho for iOS).

If you look up the dictionary entry for 掛ける in Jitendex, there are many examples of sentences from Tanaka Corpus being assigned to the wrong sense.

image

sense 9 means multiply.

image

But multiply sentence is included with sense 5.

sense 11 means take a seat, and includes the correct reference

image

but the example sentence is with sense 22, to apply (insurance)

image

Is there anything that can be done about this?

I read on the EDRDG wiki that the Tanaka Corpus is now within Tatoeba and it is its new "home". Does this mean each time we see something like this, we should correct it there?

Here's one of those sentences:

https://tatoeba.org/en/sentences/show/236991

I have an account with Tatoeba, but I'm afraid I don't know how to edit the sentences, and even if I did, would I be able to change the attribution information that links it to one of the senses in the jmdict?

Just bringing this to your attention in case it is not possible to change things at the source (the Tanaka Corpus itself) and we might need to make a file in Jitendex for all the manually assigned corrections or something.

@Kimeiga
Copy link
Author

Kimeiga commented Feb 21, 2024

image

image

One interesting observation is that i believe the example sentences are misassigned by entire groups at a time. In shirabe jisho it is clear that all the multiplication kakerus are with the spend time kakerus and all the sit down kakerus are mixed with the insurance kakerus.

@Kimeiga
Copy link
Author

Kimeiga commented Feb 21, 2024

After some research, I'm not sure but I suspect the reason this may have happened is because entries to the jmdict have been removed and others have been added and perhaps this contributed to a bunch of off by 1 errors over time that have shifted these example sentence groups around

https://www.edrdg.org/jmdict_edict_list/2021/msg00083.html

@Kimeiga
Copy link
Author

Kimeiga commented Feb 21, 2024

Another thought is to your point on #37 jreibun might come out soon and be a better source of sentences than tanaka corpus anyways, albeit not sure when it will be released

@stephenmk
Copy link
Owner

stephenmk commented Feb 21, 2024

Hi Stephen! Amazing work, thank you for contributing to the world's knowledge!

Thanks, I'm always glad to hear that people like the project.

If you look up the dictionary entry for 掛ける in Jitendex, there are many examples of sentences from Tanaka Corpus being assigned to the wrong sense.

Yes, these errors are very common. I have probably fixed a couple hundred of them over the past year.

I have an account with Tatoeba, but I'm afraid I don't know how to edit the sentences

Tatoeba has a very primitive GUI for editing the links to JMdict entries. It is technically open to the public to use, but it is extremely user-unfriendly and difficult to use correctly.

Feel free to let me know when you spot these errors and I'll go fix them. A couple of other users have also been reporting these errors to me in the discussion forum.

After some research, I'm not sure but I suspect the reason this may have happened is because entries to the jmdict have been removed and others have been added and perhaps this contributed to a bunch of off by 1 errors over time that have shifted these example sentence groups around

That is indeed a common reason for the errors. Whenever entries in JMdict are edited, the editors need to remember to update the sentence links as well. We try to keep this in mind, but sometimes we forget. I recently suggested that some of this sentence information should be displayed in the JMdict database editor to make it easier to remember, but this is a volunteer project and things don't always move quickly.

Another thought is to your point on #37 jreibun might come out soon and be a better source of sentences than tanaka corpus anyways, albeit not sure when it will be released

It's been almost a year since the last public update from that project, so I'm not sure how soon that will be. Fingers crossed.

@stephenmk stephenmk self-assigned this Feb 21, 2024
@stephenmk stephenmk added the question Further information is requested label Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants