Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only populate OCR results in selected language #11215

Open
github-throwaway opened this issue Jan 8, 2025 · 3 comments
Open

Only populate OCR results in selected language #11215

github-throwaway opened this issue Jan 8, 2025 · 3 comments

Comments

@github-throwaway
Copy link
Contributor

github-throwaway commented Jan 8, 2025

Problem

Im always frustrated when the OCR extracts ingredients from a different language.

Proposed solution

Thanks to the language setting the OCR already knows where to start the extraction. When it now hits the keyword in a different language it should stop the extraction.

Time per product

4 seconds saved.

Video example of problem

trim.3634E555-DDBF-4A2D-B2A6-B78E57AB58E0.MOV

Part of

#9096

@github-project-automation github-project-automation bot moved this to To discuss and validate in 🍊 Open Food Facts Server issues Jan 8, 2025
@github-throwaway github-throwaway changed the title Only return OCR results in selected language Only populate OCR results in selected language Jan 8, 2025
@benbenben2
Copy link
Collaborator

The OCR extracts all the text. It is then processed and by using so-called stopwords it cut the text before and after the ingredients list.
In this case it removes everything before "Zutaten" (it can be other words) and after some stopword expected at the end of the list (like keep in dry place, etc.).

It does not work all the time. It depends of the known stopwords.

In this particular example, I guess "ingredients" is not a German word, so we could add it to the stopwords for German language.

It is in this file: Ingredients.pm

(Maybe, maybe, just sharing some thoughts, we could add all stopwords before ingredients as stopwords after ingredients. I dont know if it would work. That would need some investigations (for example in cases where same word for ingredients is used in different languages that would be problematic). Ping @stephane, @aleene. )

At least for that particular example in your issue @github-throwaway you can add "Ingr(e|é)dients" for the German stopwords.

@stephanegigandet
Copy link
Contributor

That's a great idea. I was hoping that Google Cloud Vision would give us enough data to see which text is in which language, but I tried in one example and it didn't work: https://images.openfoodfacts.org/images/products/304/514/010/5502/ingredients_fr.553.json

textAnnotations: [
{
locale: "fr",
boundingPoly: {
vertices: [
{
x: 64,
y: 60
},
{
y: 60,
x: 1517
},
{
x: 1517,
y: 1817
},
{
x: 64,
y: 1817
}
]
},
description: "BOChocolat au lait du pays alpin.
Ingrédients : Sucre, beurre de cacao, pâte
de cacao, LAIT écrémé en poudre,
lactosérum en poudre (de LAIT), BEURRE
concentré, émulsifiant (lecithines de
SOJA), pâte de NOISETTE, arome. Cacao:
33 % minimum. PEUT CONTENIR AUTRES
FRUITS À COQUE ET BLE.
CCD Melkchocolade (van Alpenmelk).
Ingrediënten: suiker, cacaoboter, cacaomassa,
magere MELKPOEDER, weipoeder
(van MELK), MELKVET, emulgator
(SOJALECITHINEN), HAZELNOOTPASTA,
aroma. Cacao: ten minste 33 %. KAN
ANDERE NOTEN EN TARWE BEVATTEN."
},

@github-throwaway
Copy link
Contributor Author

What about chaining this API after the OCR?

https://cloud.google.com/translate/docs/advanced/detecting-language-v3

And add some spell checking?

https://hunspell.github.io/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: To do
Status: To discuss and validate
Development

No branches or pull requests

4 participants