You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think going parserless will get us most of the way to reasonable international performance, but we'll need to ditch permute_road and friends. I'm thinking it makes the most sense to use the following algorithm for that, for each POI:
Use OpenCage address formatting to create a text address for the POI. This happens after admin area population, which is important for libpostal language detection ("c/ de villarroel, barcelona" and "c/ de villarroel" have different expansions; the former uses catalan in addition to spanish)
Call into libpostal's expand endpoint and collect the results
For each expansion, perform the following steps:
Call into libpostal's parse endpoint and collect the parsed tokens
For each parsed token, substitute all possible abbreviations across all languages. e.g. "Saint Louis" will become ["St Louis", "Saint Louis"] because of the English personal_titles substitution dictionary. "carrer de villarroel" will become ["c/ de villarroel", "carrer de villarroel"]
Index each of the substituted tokens
I think this is a reasonable way to do international permute_street but it means we'll need to perform numex at query time, because e.g. "third ave" will become 3rd ave" during indexing, and we need to match that behavior if we want to match the correct documents.
We'll need to modify the dictionary substitution behavior to add the empty string for each street_type substitution for languages where it's appropriate, otherwise "fremont ave" won't match "fremont ave n"
TODO: strasse suffixes and whatnot will need to be handled too. I might look into the libpostal codebase to see if there's anything we can reuse there.
The text was updated successfully, but these errors were encountered:
Okay I ended up doing this way differently than expected. Namely I rely on OpenStreetMap to use the unabbreviated street names, then I go through and detect the language with lingua-rs (should use WOF official/spoken languages instead) and then do the substitutions based on that language. There are some issues. Anecdotally it seems to work ok for spanish and catalan street names but I haven't had much luck with french or german street names.
I think going parserless will get us most of the way to reasonable international performance, but we'll need to ditch
permute_road
and friends. I'm thinking it makes the most sense to use the following algorithm for that, for each POI:I think this is a reasonable way to do international permute_street but it means we'll need to perform numex at query time, because e.g. "third ave" will become 3rd ave" during indexing, and we need to match that behavior if we want to match the correct documents.
We'll need to modify the dictionary substitution behavior to add the empty string for each street_type substitution for languages where it's appropriate, otherwise "fremont ave" won't match "fremont ave n"
TODO: strasse suffixes and whatnot will need to be handled too. I might look into the libpostal codebase to see if there's anything we can reuse there.
The text was updated successfully, but these errors were encountered: