Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Zipf exponent #19

Open
christophsk opened this issue Nov 17, 2021 · 0 comments
Open

Incorrect Zipf exponent #19

christophsk opened this issue Nov 17, 2021 · 0 comments

Comments

@christophsk
Copy link

I added your Zipf cost to a classroom demo I had (https://github.com/christophsk/segment-string) and found that the string "iamnotanumberiamaperson", segments as "iam not a number iam a person" instead of "i am a not a number i am a person". The latter is found using word probabilities from English Wikipedia.

The cause is the Zipf exponent log(len(words)) in

self._wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))

This value is too large. The exponent is a constant, independent of the size of the language model. Measurements suggest a value of about 2.5 i.e., frequency is proportional to 1 / rank^2.5. Using this value produces a correct result.

Suggest Line 33 in wordninja.py be changed to

self._wordcost = dict((k, log((i+1) * 2.5) for i,k in enumerate(words))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant