Pyper is a framework for generating high-quality synthetic data for LLM instruction tuning. It combines several research approaches to create diverse and high-quality training datasets.
- General Seed Generation: Generate seed data across various academic disciplines
- Knowledge-Driven Generation: Create datasets based on specific domain knowledge by supplying ground truth information
- Data Fission: Expand small seed datasets through iterative generation using LLM
Pyper uses Poetry for dependency management. To install:
# Install poetry
curl -sSL https://install.python-poetry.org | python3 -
# Clone the repository
git clone https://github.com/yourusername/pyper.git
cd pyper
# Install dependencies
poetry install
First, set your OpenAI API key:
export OPENAI_API_KEY=your-openai-api-key
Generate seed data for a specific academic discipline:
python run.py generate --mode general --discipline mathematics --num-tasks 50
Example disciplines include mathematics, physics, chemistry, etc. Use --help
to see all options.
Generate data using specific domain knowledge:
python run.py generate --mode knowledge \
--knowledge-path ./path/to/knowledge.txt \
--num-tasks 50
The knowledge file should contain the ground truth information in plain text format.
Expand an existing seed dataset:
python run.py fission \
--num-tasks 50 \
---seed-path path/to/seed \
To see all available options, use --help
flag
Pyper's implementation is based on several key research papers:
@misc{selfinstruct,
title={Self-Instruct: Aligning Language Model with Self Generated Instructions},
author={Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh},
journal={arXiv preprint arXiv:2212.10560},
year={2022}
}
@misc{alpaca,
author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
title = {Stanford Alpaca: An Instruction-following LLaMA model},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}},
}
@article{luo2023wizardmath,
title={WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct},
author={Luo, Haipeng and Sun, Qingfeng and Xu, Can and Zhao, Pu and Lou, Jianguang and Tao, Chongyang and Geng, Xiubo and Lin, Qingwei and Chen, Shifeng and Zhang, Dongmei},
journal={arXiv preprint arXiv:2308.09583},
year={2023}
}
Additional research: