-
-
Notifications
You must be signed in to change notification settings - Fork 400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only populate OCR results in selected language #11215
Comments
The OCR extracts all the text. It is then processed and by using so-called stopwords it cut the text before and after the ingredients list. It does not work all the time. It depends of the known stopwords. In this particular example, I guess "ingredients" is not a German word, so we could add it to the stopwords for German language. It is in this file: Ingredients.pm (Maybe, maybe, just sharing some thoughts, we could add all stopwords before ingredients as stopwords after ingredients. I dont know if it would work. That would need some investigations (for example in cases where same word for ingredients is used in different languages that would be problematic). Ping @stephane, @aleene. ) At least for that particular example in your issue @github-throwaway you can add "Ingr(e|é)dients" for the German stopwords. |
That's a great idea. I was hoping that Google Cloud Vision would give us enough data to see which text is in which language, but I tried in one example and it didn't work: https://images.openfoodfacts.org/images/products/304/514/010/5502/ingredients_fr.553.json
|
What about chaining this API after the OCR? https://cloud.google.com/translate/docs/advanced/detecting-language-v3 And add some spell checking? |
Problem
Im always frustrated when the OCR extracts ingredients from a different language.
Proposed solution
Thanks to the language setting the OCR already knows where to start the extraction. When it now hits the keyword in a different language it should stop the extraction.
Time per product
4 seconds saved.
Video example of problem
trim.3634E555-DDBF-4A2D-B2A6-B78E57AB58E0.MOV
Part of
#9096
The text was updated successfully, but these errors were encountered: