Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make better use of spoken language data in WhosOnFirst #12

Open
ellenhp opened this issue Feb 25, 2024 · 0 comments
Open

Make better use of spoken language data in WhosOnFirst #12

ellenhp opened this issue Feb 25, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@ellenhp
Copy link
Owner

ellenhp commented Feb 25, 2024

At a bare minimum, spoken language data should inform the dictionary choice used for generating all the abbreviation permutations in airmail_indexer.

I also want to find a way to use it to correctly stem languages. Once focus point queries are supported (currently we only have bounding box queries) we can lookup into WOF the spoken languages in the focus point and surrounding areas and use stemmers for those languages. Doing this will involve splitting out the fields we use by language. Currently there's only one field, "content", but eventually we'll need more for handling matches that need to get boosted. Outside the scope of this issue, but those boosted fields may need a version for each language also. I'm thinking we can use lingua-rs to pick the top 5 possible languages for every query, and then search against those fields in a disjunction, using stemmers as appropriate?

There will be a performance cost to this of course, but the lack of stemmers is really disappointing because with lenient mode off (no prefix queries allowed) I can't search for "mighty-o donut" if the POI is called "mighty-o donuts". When I briefly had stemming working on a feature branch it was so cool to watch things like "tow truck" match "XYZ towing company". That's the kind of thing that I think airmail needs to really stand out, even if it has to be disabled for remote indexes.

@ellenhp ellenhp added the enhancement New feature or request label Feb 25, 2024
@ellenhp ellenhp added this to the Initial release milestone Feb 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant