Make better use of spoken language data in WhosOnFirst #12

ellenhp · 2024-02-25T18:20:53Z

At a bare minimum, spoken language data should inform the dictionary choice used for generating all the abbreviation permutations in airmail_indexer.

I also want to find a way to use it to correctly stem languages. Once focus point queries are supported (currently we only have bounding box queries) we can lookup into WOF the spoken languages in the focus point and surrounding areas and use stemmers for those languages. Doing this will involve splitting out the fields we use by language. Currently there's only one field, "content", but eventually we'll need more for handling matches that need to get boosted. Outside the scope of this issue, but those boosted fields may need a version for each language also. I'm thinking we can use lingua-rs to pick the top 5 possible languages for every query, and then search against those fields in a disjunction, using stemmers as appropriate?

There will be a performance cost to this of course, but the lack of stemmers is really disappointing because with lenient mode off (no prefix queries allowed) I can't search for "mighty-o donut" if the POI is called "mighty-o donuts". When I briefly had stemming working on a feature branch it was so cool to watch things like "tow truck" match "XYZ towing company". That's the kind of thing that I think airmail needs to really stand out, even if it has to be disabled for remote indexes.

The text was updated successfully, but these errors were encountered:

ellenhp added the enhancement New feature or request label Feb 25, 2024

ellenhp added this to the Initial release milestone Feb 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make better use of spoken language data in WhosOnFirst #12

Make better use of spoken language data in WhosOnFirst #12

ellenhp commented Feb 25, 2024

Make better use of spoken language data in WhosOnFirst #12

Make better use of spoken language data in WhosOnFirst #12

Comments

ellenhp commented Feb 25, 2024