This repository contains the code and resources for our paper "Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai" submitted to ACL SRW 2024. Our work presents a novel approach to generating synthetic instruction-tuning data for low-resource languages, with a specific focus on Thai.
- Seed-free framework for generating synthetic instruction-tuning data
- Incorporation of three key properties: fluency, diversity, and cultural context
- Data-efficient approach achieving competitive results with only 5,000 instructions
- Comprehensive evaluation across multiple models, datasets, and tasks
-
Clone this repository:
git clone https://github.com/parinzee/seed-free-synthetic-instruct.git cd seed-free-synthetic-instruct
-
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
- Firstly, make a copy of
example-settings.toml
, and configure the models (openai, claude, vllm, groq, etc). - Configure language within the settings to set the language
- Generate data!
python3 -m clsit.runner --generate /path/to/yaml/config
To get a clean jsonl file ready to be trained with axolotl:
python3 -m clsit.runner --clean /path/to/yaml/config
python3 -m clsit.runner --export /path/to/yaml/config
The jsonl files will be visible in your configured output directory under:
train_data.jsonl
val_data.jsonl
Please see some of our axolotl configurations to see how to use these files to train.
- Use VLLM to host your finetuned model.
- Run prediction:
cd eval/ python3 eval_vllm.py --model-name SERVED_VLLM_MODEL_NAME --few-shot 0
- Calculate scores:
python3 calculate_scores.py .
- Visualize scores:
- First, please edit
eval/visualize_results.py
and put your model name in the model name dictionary - Then run:
python3 visualize_results.py
- First, please edit
Dataset: We release our best performing dataset and make it publicly accessible at Huggingface Dataset.
Model: We release our best performing model at this huggingface repo.
Our best-performing synthetic dataset (F+ C+ D+) achieved competitive results compared to state-of-the-art Thai LLMs, using only 5,000 instructions. Key findings include:
- Comparable performance to WangchanX and OpenThaiGPT
- Second-highest BERTScore on both Thai Culture and General Test Sets
- Significant improvement over baseline models lacking key properties
For detailed results and analysis, please refer to the paper and the results/
directory.
@inproceedings{pengpun-etal-2024-seed,
title = "Seed-Free Synthetic Data Generation Framework for Instruction-Tuning {LLM}s: A Case Study in {T}hai",
author = "Pengpun, Parinthapat and
Udomcharoenchaikit, Can and
Buaphet, Weerayut and
Limkonchotiwat, Peerat",
editor = "Fu, Xiyan and
Fleisig, Eve",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-srw.38",
doi = "10.18653/v1/2024.acl-srw.38",
pages = "438--457",
abstract = "We present a synthetic data approach for instruction-tuning large language models (LLMs) for low-resource languages in a data-efficient manner, specifically focusing on Thai. We identify three key properties that contribute to the effectiveness of instruction-tuning datasets: fluency, diversity, and cultural context. We propose a seed-data-free framework for generating synthetic instruction-tuning data that incorporates these essential properties. Our framework employs an LLM to generate diverse topics, retrieve relevant contexts from Wikipedia, and create instructions for various tasks, such as question answering, summarization, and conversation. The experimental results show that our best-performing synthetic dataset, which incorporates all three key properties, achieves competitive performance using only 5,000 instructions when compared to state-of-the-art Thai LLMs trained on hundreds of thousands of instructions. Our code and dataset are publicly available at https://github.com/parinzee/seed-free-synthetic-instruct.",
}
This project is licensed under the MIT License.
We extend our sincere gratitude to Potsawee Manakul for his invaluable assistance during the early stages of this project.
This research has received funding support from the NSRF via the Program Management Unit for Human Resources & Institutional Development, Research and Innovation Grant Number B46G670083.
For any questions or concerns, please open an issue in this repository.