Tutorial: Classifying Documents & Queries by Language - DocumentStore per language #7569
-
I just went through the tutorial "Tutorial: Classifying Documents & Queries by Language". In that example, English, Spanish and French DocumentStore objects are created. What if we did not know the languages we might run into beforehand or we had say 12 languages to support? Is there a more efficient way to handle this language classification without creating so many objects or without knowing/limiting the languages which are generated? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @greghobby In that case the |
Beta Was this translation helpful? Give feedback.
Hi @greghobby In that case the
DocumentLanguageClassifier
is still the component to use. https://docs.haystack.deepset.ai/docs/documentlanguageclassifierIt uses langdetect under the hood, which supports 55 languages. You can initialize the DocumentLanguageClassifier with as many of these languages as you want:
document_classifier = DocumentLanguageClassifier(languages = ["en", "de", ...])
The language of the classified documents will be stored in the metadata of the documents.