Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai

Parinthapat Pengpun, Can Udomcharoenchaikit, Weerayut Buaphet, Peerat Limkonchotiwat

Overview

This repository contains the code and resources for our paper "Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai" submitted to ACL SRW 2024. Our work presents a novel approach to generating synthetic instruction-tuning data for low-resource languages, with a specific focus on Thai.

Key Features

Seed-free framework for generating synthetic instruction-tuning data
Incorporation of three key properties: fluency, diversity, and cultural context
Data-efficient approach achieving competitive results with only 5,000 instructions
Comprehensive evaluation across multiple models, datasets, and tasks

Installation

Clone this repository:

git clone https://github.com/parinzee/seed-free-synthetic-instruct.git
cd seed-free-synthetic-instruct

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the required packages:
```
pip install -r requirements.txt
```

Data Generation

Firstly, make a copy of example-settings.toml, and configure the models (openai, claude, vllm, groq, etc).
Configure language within the settings to set the language

Generate data!

python3 -m clsit.runner --generate /path/to/yaml/config

Export Generated Data

To get a clean jsonl file ready to be trained with axolotl:

python3 -m clsit.runner --clean /path/to/yaml/config
python3 -m clsit.runner --export /path/to/yaml/config

The jsonl files will be visible in your configured output directory under:

train_data.jsonl
val_data.jsonl

Please see some of our axolotl configurations to see how to use these files to train.

Evaluation

Use VLLM to host your finetuned model.

Run prediction:

cd eval/
python3 eval_vllm.py --model-name SERVED_VLLM_MODEL_NAME --few-shot 0

Calculate scores:
```
python3 calculate_scores.py .
```
Visualize scores:
- First, please edit eval/visualize_results.py and put your model name in the model name dictionary
- Then run:
```
python3 visualize_results.py
```

Results

Dataset: We release our best performing dataset and make it publicly accessible at Huggingface Dataset.

Model: We release our best performing model at this huggingface repo.

Our best-performing synthetic dataset (F+ C+ D+) achieved competitive results compared to state-of-the-art Thai LLMs, using only 5,000 instructions. Key findings include:

Comparable performance to WangchanX and OpenThaiGPT
Second-highest BERTScore on both Thai Culture and General Test Sets
Significant improvement over baseline models lacking key properties

For detailed results and analysis, please refer to the paper and the results/ directory.

Citation

@inproceedings{pengpun-etal-2024-seed,
    title = "Seed-Free Synthetic Data Generation Framework for Instruction-Tuning {LLM}s: A Case Study in {T}hai",
    author = "Pengpun, Parinthapat  and
      Udomcharoenchaikit, Can  and
      Buaphet, Weerayut  and
      Limkonchotiwat, Peerat",
    editor = "Fu, Xiyan  and
      Fleisig, Eve",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-srw.38",
    doi = "10.18653/v1/2024.acl-srw.38",
    pages = "438--457",
    abstract = "We present a synthetic data approach for instruction-tuning large language models (LLMs) for low-resource languages in a data-efficient manner, specifically focusing on Thai. We identify three key properties that contribute to the effectiveness of instruction-tuning datasets: fluency, diversity, and cultural context. We propose a seed-data-free framework for generating synthetic instruction-tuning data that incorporates these essential properties. Our framework employs an LLM to generate diverse topics, retrieve relevant contexts from Wikipedia, and create instructions for various tasks, such as question answering, summarization, and conversation. The experimental results show that our best-performing synthetic dataset, which incorporates all three key properties, achieves competitive performance using only 5,000 instructions when compared to state-of-the-art Thai LLMs trained on hundreds of thousands of instructions. Our code and dataset are publicly available at https://github.com/parinzee/seed-free-synthetic-instruct.",
}

License

This project is licensed under the MIT License.

Acknowledgments

We extend our sincere gratitude to Potsawee Manakul for his invaluable assistance during the early stages of this project.

This research has received funding support from the NSRF via the Program Management Unit for Human Resources & Institutional Development, Research and Innovation Grant Number B46G670083.

Contact

For any questions or concerns, please open an issue in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
clsit		clsit
configs		configs
datasets		datasets
eval		eval
.gitignore		.gitignore
README.md		README.md
example-settings.toml		example-settings.toml
pipeline.png		pipeline.png
table.png		table.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai

Parinthapat Pengpun, Can Udomcharoenchaikit, Weerayut Buaphet, Peerat Limkonchotiwat

Overview

Key Features

Installation

Data Generation

Export Generated Data

Evaluation

Results

Citation

License

Acknowledgments

Contact

About

Releases

Packages

Languages

parinzee/seed-free-synthetic-instruct

Folders and files

Latest commit

History

Repository files navigation

Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai

Parinthapat Pengpun, Can Udomcharoenchaikit, Weerayut Buaphet, Peerat Limkonchotiwat

Overview

Key Features

Installation

Data Generation

Export Generated Data

Evaluation

Results

Citation

License

Acknowledgments

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages