-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support of search_languages so that we can limit the list of lan… #615
base: master
Are you sure you want to change the base?
Conversation
Here is the usage example of how you can use the "search_languages" parameter to limit the list of languages ES will search through: /api/?q=gdansk&limit=7&lang=en&search_languages=en,it,de In this case, even if photon is running with the extended -languages list (ru,en,de,uk,es,pt,fr,zh,pl,it,tr,ja,vi,id,ms,th,hi), the ES query will search only through English, Italian, German and local languages. |
I very much agree that this is a much-needed feature. However, I am wondering: do we really need an extra parameter for it? Or would it not be sufficient to extend the
The disadvantage is that such an approach would change the meaning of lang. In particular: when a single language is used then we really only search in that language and the local language. Would that be an issue? |
On the technical side: the CI error has nothing to do with your PR. I've fixed it on master. Can you please rebase at some point. Also, the new feature will need a test eventually but lets settle on the API first. |
Yes, I think that would be an issue.
-languages es,pt,en and use the lang parameter to specify the output language. Their search will be restricted to only one (output) language. Also, there will be an issue for some countries that have multiple official or very widespread languages (pt/es, ru/ua and so on). The output language is set in our application across the company in which a user works. For example, a company might decide that its official language is Ukranian. At the same time, the company might have Russian-speaking employees (whose second language is Ukranian) who will prefer to search adressees in their first language (which is Russian) even though the results will still be in Ukranian. As far as I understand the same issue might apply to es/pt speaking countries and fr/de/it countries. I am not sure about the Asia region. |
…guages for ES search via request param
493d01a
to
0e0a922
Compare
Done with the rebase |
I have added some tests for the new search_languages parameter. |
The I see two fundamental problems with the PR in its current state:
Sorting out the language parameters is tricky. It is not always obvious, which language people are going to use for toponyms, especially in areas foreign to them. Similarly, when dealing with OSM input data it is not always clear what the language-specific tags contain. The translation is sometimes less known than the local language term. Nominatim has always dealt with this by searching through all data and that seems to work reasonably well. Photon used to search only in the local and chosen language and that has frequently caused problems. It has only very recently switched to an algorithm where it searches through all languages. That's what is causing the performance problems now. So what does that mean? Taking user language preferences into account when ranking results is a good thing. That's what I referred to in my first comment. I'm just not convinced that not considering foreign-language terms in the first place works out well. Unfortunately that doesn't get us very far in terms of the performance issue because 2 of the 3 places where the supported languages come into play are filter terms in the query. Only the part that prefers full word terms relates to ranking exclusively. In the end, the performance issue needs to be tackled separately by changing our index structure in a way that filter queries over many languages become a lot cheaper. That should be possible. The current multi-language support tried to build on the existing indexes to remain backwards-compatible with the database. But going forward we can alter the database structure. NB: using the latest released version, you can run different instances of Photon with different language settings, to restrict the set of languages used. The instances can be run against the same database. The set of supported languages is a run-time setting. So you can have a database that indexes all languages and then one instance for the es/pt/en users, one instance for the ru/uk/en users etc. |
I completely agree with the above. I will be waiting for the performance issue fix.
No, it's not. In our case, we know exaclty what languages are expected for search because our software has already some built-in language settings that can be utilized as search languages. That's why for us the solution is going to work well until the perfomance fix is released.
I doubt that it is a good idea to have 17 instances of photon running for each of the supported language: en,de,uk,es,pt,ru,fr,zh,pl,it,tr,ja,vi,id,ms,th,hi At least, it seems to be a very RAM consuming solution. |
We have a perfomance issue when running photon with multiple languages like so:
-languages ru,en,de,uk,es,pt,fr,zh,pl,it,tr,ja,vi,id,ms,th,hi
we don't know exactly what the client langauge will be. It might be any language.
In this case when a new request comes in, photon builds a query that tries to search across all supported langues but this is too slow and unnecessary. In our case, we know exactly that for a client, say, from Brazil we only need to search across pt, es, en languages, but we do not have this possibility.
This commit adds a new "search_languages" parameter that can be utilized in API to specify a comma-separated list of languages that ES should search through. If the parameter is not specified photon will use the list of supported languages from the -languages param (the old logic applies).