You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I added your Zipf cost to a classroom demo I had (https://github.com/christophsk/segment-string) and found that the string "iamnotanumberiamaperson", segments as "iam not a number iam a person" instead of "i am a not a number i am a person". The latter is found using word probabilities from English Wikipedia.
The cause is the Zipf exponent log(len(words)) in
self._wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
This value is too large. The exponent is a constant, independent of the size of the language model. Measurements suggest a value of about 2.5 i.e., frequency is proportional to 1 / rank^2.5. Using this value produces a correct result.
Suggest Line 33 in wordninja.py be changed to
self._wordcost = dict((k, log((i+1) * 2.5) for i,k in enumerate(words))
The text was updated successfully, but these errors were encountered:
I added your Zipf cost to a classroom demo I had (https://github.com/christophsk/segment-string) and found that the string "iamnotanumberiamaperson", segments as "iam not a number iam a person" instead of "i am a not a number i am a person". The latter is found using word probabilities from English Wikipedia.
The cause is the Zipf exponent
log(len(words))
inThis value is too large. The exponent is a constant, independent of the size of the language model. Measurements suggest a value of about 2.5 i.e., frequency is proportional to 1 / rank^2.5. Using this value produces a correct result.
Suggest Line 33 in
wordninja.py
be changed toThe text was updated successfully, but these errors were encountered: