Tutorial: Classifying Documents & Queries by Language - DocumentStore per language #7569

greghobby · 2024-04-22T11:38:30Z

greghobby
Apr 22, 2024

I just went through the tutorial "Tutorial: Classifying Documents & Queries by Language". In that example, English, Spanish and French DocumentStore objects are created. What if we did not know the languages we might run into beforehand or we had say 12 languages to support? Is there a more efficient way to handle this language classification without creating so many objects or without knowing/limiting the languages which are generated?

Answered by julian-risch

Apr 22, 2024

Hi @greghobby In that case the DocumentLanguageClassifier is still the component to use. https://docs.haystack.deepset.ai/docs/documentlanguageclassifier
It uses langdetect under the hood, which supports 55 languages. You can initialize the DocumentLanguageClassifier with as many of these languages as you want:
document_classifier = DocumentLanguageClassifier(languages = ["en", "de", ...])
The language of the classified documents will be stored in the metadata of the documents.

View full answer

julian-risch · 2024-04-22T11:46:05Z

julian-risch
Apr 22, 2024
Maintainer

Hi @greghobby In that case the DocumentLanguageClassifier is still the component to use. https://docs.haystack.deepset.ai/docs/documentlanguageclassifier
It uses langdetect under the hood, which supports 55 languages. You can initialize the DocumentLanguageClassifier with as many of these languages as you want:
document_classifier = DocumentLanguageClassifier(languages = ["en", "de", ...])
The language of the classified documents will be stored in the metadata of the documents.

1 reply

greghobby Apr 22, 2024
Author

Great, thanks! The kind of approach I was looking for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial: Classifying Documents & Queries by Language - DocumentStore per language #7569

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Tutorial: Classifying Documents & Queries by Language - DocumentStore per language #7569

greghobby Apr 22, 2024

Replies: 1 comment · 1 reply

julian-risch Apr 22, 2024 Maintainer

greghobby Apr 22, 2024 Author

greghobby
Apr 22, 2024

Replies: 1 comment 1 reply

julian-risch
Apr 22, 2024
Maintainer

greghobby Apr 22, 2024
Author