Skip to content

Latest commit

 

History

History
99 lines (65 loc) · 10 KB

README.md

File metadata and controls

99 lines (65 loc) · 10 KB

SEACrowd Logo

SEACrowd Experiments

Experiment repository for the "SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages" paper.

SEACrowd is a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 Southeast Asian (SEA) languages across three modalities.

Note: All code in SEACrowd is publicly available under the Apache 2.0 license.

Quick Start

URL Description
Paper Our "SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages" paper on Arxiv
Landing Page Introduction to SEACrowd
SEACrowd Catalogue (web/csv) Centralized publicly available datasheets
SEACrowd Data Hub (github/pip) Standardized dataloaders & schema library
SEACrowd Experiments Experiment repository for SEACrowd NLP, VL, & speech benchmarks, translationese vs. naturalness assessment, language equity, language prioritization, etc.
HuggingFace Collection Our fine-tuned translationese classifier & train/test data

SEACrowd Benchmarks: State-of-the-Art Models on SEA Languages

Placed under evaluation/.

Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.

To run a model on our benchmark, simply take add the relevant model to our evaluation script(s):

The evaluation run will create a log in evaluation/outputs_*/ directory to record the model predictions/generations, and finally a summary of the performance measures per data subset in evaluation/metrics_*/.

Note: If you want to modify the prompt templates, check out evaluation/prompt_utils.py. If you want to modify the data subsets being used in the benchmark, check out evaluation/data_utils.py.

Evaluation Results

LLMs

NLP Performance

Speech Models

Speech Performance

VLMs

VL Performance

Generation Quality in SEA Languages: Translationese vs. Natural

Placed under translationese/.

To analyze the generation quality of LLMs in SEA languages, we build a text classifier to discriminate between translationese and natural texts. We construct a translationese classification training and testing dataset using 49 and 62 data subsets, respectively, covering approximately 39.9k and 51.5k sentences across 9 SEA languages: English (eng), Indonesian (ind), Khmer (khm), Lao (lao), Burmese (mya), Filipino (fil), Thai (tha), Vietnamese (vie), and Malay (zlm).

Our translationese vs. natural train/test data is available on HuggingFace!

To fine-tune the translationese classifier, execute translationese/run.sh. We use a binary label (translationese, i.e., machine-translated or human-translated, or natural, i.e., human-generated) instead of 3 labels (machine-translated, human-translated, human-generated).

Our fine-tuned mDEBERTA SEA translationese classifier is available on HuggingFace!

To obtain the model generations for the naturalness analysis, execute translationese/run_nlg_prompt.sh for open-source models and translationese/run_nlg_prompt_commercial.sh for commercial models.

To classify the model generations, check out translationese/analyze_translationese.ipynb.

Naturalness Result

Naturalness Result

Additional Analysis: Language Equity, Resource Gaps, and Language Prioritization

The notebooks/viz_metrics.ipynb consists of the visualization of all benchmark results in figures/ and notebooks/analysis/, as well as language equity results in terms of Gini coefficient in notebooks/analysis/.g

The notebooks/viz_resources.ipynb consists of the visualization of resource gaps analysis in figures/ and notebooks/analysis/.

The globalutily/run_make_task_boxes.sh consists of the visualization of SEA language prioritization analysis. This submodule is derived from neubig/globalutility.

Citation

If you are using any resources from SEACrowd, including datasheets, dataloaders, code, etc., please cite the following publication:

@article{lovenia2024seacrowd,
      title={SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages}, 
      author={Holy Lovenia and Rahmad Mahendra and Salsabil Maulana Akbar and Lester James V. Miranda and Jennifer Santoso and Elyanah Aco and Akhdan Fadhilah and Jonibek Mansurov and Joseph Marvin Imperial and Onno P. Kampman and Joel Ruben Antony Moniz and Muhammad Ravi Shulthan Habibi and Frederikus Hudi and Railey Montalan and Ryan Ignatius and Joanito Agili Lopo and William Nixon and Börje F. Karlsson and James Jaya and Ryandito Diandaru and Yuze Gao and Patrick Amadeus and Bin Wang and Jan Christian Blaise Cruz and Chenxi Whitehouse and Ivan Halim Parmonangan and Maria Khelli and Wenyu Zhang and Lucky Susanto and Reynard Adha Ryanda and Sonny Lazuardi Hermawan and Dan John Velasco and Muhammad Dehan Al Kautsar and Willy Fitra Hendria and Yasmin Moslem and Noah Flynn and Muhammad Farid Adilazuarda and Haochen Li and Johanes Lee and R. Damanhuri and Shuo Sun and Muhammad Reza Qorib and Amirbek Djanibekov and Wei Qi Leong and Quyet V. Do and Niklas Muennighoff and Tanrada Pansuwan and Ilham Firdausi Putra and Yan Xu and Ngee Chia Tai and Ayu Purwarianti and Sebastian Ruder and William Tjhi and Peerat Limkonchotiwat and Alham Fikri Aji and Sedrick Keh and Genta Indra Winata and Ruochen Zhang and Fajri Koto and Zheng-Xin Yong and Samuel Cahyawijaya},
      year={2024},
      eprint={2406.10118},
      journal={arXiv preprint arXiv: 2406.10118}
}