Incorrect Zipf exponent #19

christophsk · 2021-11-17T18:00:13Z

I added your Zipf cost to a classroom demo I had (https://github.com/christophsk/segment-string) and found that the string "iamnotanumberiamaperson", segments as "iam not a number iam a person" instead of "i am a not a number i am a person". The latter is found using word probabilities from English Wikipedia.

The cause is the Zipf exponent log(len(words)) in

self._wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))

This value is too large. The exponent is a constant, independent of the size of the language model. Measurements suggest a value of about 2.5 i.e., frequency is proportional to 1 / rank^2.5. Using this value produces a correct result.

Suggest Line 33 in wordninja.py be changed to

self._wordcost = dict((k, log((i+1) * 2.5) for i,k in enumerate(words))

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect Zipf exponent #19

Incorrect Zipf exponent #19

christophsk commented Nov 17, 2021

Incorrect Zipf exponent #19

Incorrect Zipf exponent #19

Comments

christophsk commented Nov 17, 2021