NLLB Seed is a set of professionally-translated sentences in the Wikipedia domain. Data for NLLB-Seed was sampled from Wikimedia’s List of articles every Wikipedia should have, a collection of topics in different fields of knowledge and human activity. NLLB-Seed consists of around six thousand sentences in 39 languages. NLLB-Seed is meant to be used for training rather than model evaluation. Due to this difference, NLLB-Seed does not go through the human quality assurance process present in FLORES-200.
For newer versions of this dataset, see https://github.com/openlanguagedata/seed and https://www.oldi.org.
The original version of the dataset can still be downloaded here.
Language | FLORES-200 code |
---|---|
Acehnese (Arabic script) | ace_Arab |
Acehnese (Latin script) | ace_Latn |
Moroccan Arabic | ary_Arab |
Egyptian Arabic | arz_Arab |
Bambara | bam_Latn |
Balinese | ban_Latn |
Bhojpuri | bho_Deva |
Banjar (Arabic script) | bjn_Arab |
Banjar (Latin script) | bjn_Latn |
Buginese | bug_Latn |
Crimean Tatar | crh_Latn |
Southwestern Dinka | dik_Latn |
Dzongkha | dzo_Tibt |
Friulian | fur_Latn |
Nigerian Fulfulde | fuv_Latn |
Guarani | grn_Latn |
Chhattisgarhi | hne_Deva |
Kashmiri (Arabic script) | kas_Arab |
Kashmiri (Devanagari script) | kas_Deva |
Central Kanuri (Arabic script) | knc_Arab |
Central Kanuri (Latin script) | knc_Latn |
Ligurian | lij_Latn |
Limburgish | lim_Latn |
Lombard | lmo_Latn |
Latgalian | ltg_Latn |
Magahi | mag_Deva |
Meitei (Bengali script) | mni_Beng |
Maori | mri_Latn |
Nuer | nus_Latn |
Dari | prs_Arab |
Southern Pashto | pbt_Arab |
Sicilian | scn_Latn |
Shan | shn_Mymr |
Sardinian | srd_Latn |
Silesian | szl_Latn |
Tamasheq (Latin script) | taq_Latn |
Tamasheq (Tifinagh script) | taq_Tfng |
Central Atlas Tamazight | tzm_Tfng |
Venetian | vec_Latn |