-
-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added a module to split Japanese words #3158
base: master
Are you sure you want to change the base?
Conversation
The failing tests may not be related to you code. I have the same in an unrelated change. I'm investigating. |
Can you please rebase your code on master? This should make the CI errors go away. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good. Just two minor comments from my side.
for p in phrases))) | ||
return normalized | ||
|
||
def split_key_japanese_phrases( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably forgot to delete the old code here.
@@ -29,6 +29,7 @@ class BreakType(enum.Enum): | |||
""" Break created as a result of tokenization. | |||
This may happen in languages without spaces between words. | |||
""" | |||
SOFT_PHRASE = ':' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add documentation for this new type, just like it is done in the lines above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for your help and comments.
I added the documentation.
8f43956
to
dfbacf4
Compare
In these codes, Japanese addresses are divided into three categories based on administrative divisions: cities, municipalities, and below.
Nominatim uses ICU (International Components for Unicode) transliteration for user-entered addresses to split them into meaningful words. Here is an example of debugging. There are many candidates.
Fig. 1 The example of debugging.
To help make this division more accurate, when there are large administrative divisions (prefecture and city) in the string, we pre-separate them in the algorithm and put "," markers between the split words.
This "," is set to BreakType.SOFT_PHRASE in the program and words with this node are penalized with a lower search priority.
The node relationship is as follows
(1)--da->(2)--ban->(3)--shi->(4)--da->(5)--ban->(6)
|| ^^ ||
|+------大阪市--------------+ +-------大阪--------+|
+-------------------大阪市大阪---------------------+
As a result of this change, "大阪市大阪" with SOFT_PHRASE is penalized more and given lower search priority than "大阪市", the name of a city (the fifth value from the left is the penalty value).
Fig. 2 Before the change.
Fig. 3 After the change.