Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Short Form / Long Form inconsistency for BCCWJ #8

Open
JSchoreels opened this issue Nov 28, 2024 · 0 comments
Open

Short Form / Long Form inconsistency for BCCWJ #8

JSchoreels opened this issue Nov 28, 2024 · 0 comments

Comments

@JSchoreels
Copy link

Hello !

Thanks already for the work you did, it's a very useful tool for me.

I just remarked that there is some inconsistencies on how your terms_BCCWJ.js is build :

BCCWJ gives 2 frequencies : One based on "short form" and one for "long form". The first one is the frequency of the word when it's a standalone usage, and the long form is when it compounds.

Example : さん is extremely popular as a compound but very rare as a standalone one.

So BCCWJ gives this value :
image

But your dictionnary is setup to give it 4024 :
image

Sometimes, it takes the long form, sometimes, the short form rank, yet, it does not seem to follow any specific rule :

image

上げる 160
  • BCCWJ: 160
  • BCCWJ: 229
本来 1605
  • BCCWJ: 1394
  • BCCWJ: 1605

As you can see, for あげる it took the first number which is the lowest, and for 本来 it took the second which was the highest.

So while I could understand that only one value would be returned, I think it's a bit inconsistent to not really know which one is taken.

What do you think about it ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant