Skip to content

Latest commit

 

History

History
57 lines (49 loc) · 2.04 KB

README.md

File metadata and controls

57 lines (49 loc) · 2.04 KB

No Language Left Behind Seed Data

NLLB Seed is a set of professionally-translated sentences in the Wikipedia domain. Data for NLLB-Seed was sampled from Wikimedia’s List of articles every Wikipedia should have, a collection of topics in different fields of knowledge and human activity. NLLB-Seed consists of around six thousand sentences in 39 languages. NLLB-Seed is meant to be used for training rather than model evaluation. Due to this difference, NLLB-Seed does not go through the human quality assurance process present in FLORES-200.


Download

⚠️ This repository is no longer being updated ⚠️

For newer versions of this dataset, see https://github.com/openlanguagedata/seed and https://www.oldi.org.

The original version of the dataset can still be downloaded here.

Languages in NLLB - Seed

Language FLORES-200 code
Acehnese (Arabic script) ace_Arab
Acehnese (Latin script) ace_Latn
Moroccan Arabic ary_Arab
Egyptian Arabic arz_Arab
Bambara bam_Latn
Balinese ban_Latn
Bhojpuri bho_Deva
Banjar (Arabic script) bjn_Arab
Banjar (Latin script) bjn_Latn
Buginese bug_Latn
Crimean Tatar crh_Latn
Southwestern Dinka dik_Latn
Dzongkha dzo_Tibt
Friulian fur_Latn
Nigerian Fulfulde fuv_Latn
Guarani grn_Latn
Chhattisgarhi hne_Deva
Kashmiri (Arabic script) kas_Arab
Kashmiri (Devanagari script) kas_Deva
Central Kanuri (Arabic script) knc_Arab
Central Kanuri (Latin script) knc_Latn
Ligurian lij_Latn
Limburgish lim_Latn
Lombard lmo_Latn
Latgalian ltg_Latn
Magahi mag_Deva
Meitei (Bengali script) mni_Beng
Maori mri_Latn
Nuer nus_Latn
Dari prs_Arab
Southern Pashto pbt_Arab
Sicilian scn_Latn
Shan shn_Mymr
Sardinian srd_Latn
Silesian szl_Latn
Tamasheq (Latin script) taq_Latn
Tamasheq (Tifinagh script) taq_Tfng
Central Atlas Tamazight tzm_Tfng
Venetian vec_Latn