data-simulator
is a lightweight Python library for generating synthetic datasets from your own corpus — perfect for testing, evaluating, or fine-tuning LLM Applications.
Real documents contain a mix of useful and irrelevant content. When generating synthetic data, this leads to:
- Queries that real users would never ask
- Test sets that don't reflect actual usage
- Wasted effort optimizing for the wrong things
Data Simulator filters out low-quality content first, then generates realistic queries and answers that match how your system will actually be used.
Install from PyPI:
pip install llm-data-simulator
Or install it locally:
git clone https://github.com/langwatch/data-simulator.git
cd data-simulator
pip install -e .
Run the built-in test script:
python test.py
from data_simulator import DataSimulator
from dotenv import load_dotenv
import os
from data_simulator.utils import display_results
load_dotenv()
generator = DataSimulator(api_key=os.getenv("OPENAI_API_KEY"))
results = generator.generate_from_docs(
file_paths=["test_data/nike_10k.pdf"],
context="You're a financial support assistant for Nike, helping a financial analyst decide whether to invest in the stock.",
example_queries="how much revenue did nike make last year\nwhat risks does nike face\nwhat are nike's top 3 priorities"
)
display_results(results)
{
"id": "chunk_42",
"document": "Nike reported annual revenue of $44.5 billion for fiscal year 2022, an increase of 5% compared to the previous year.",
"query": "What was Nike's revenue growth in 2022?",
"answer": "Nike's revenue grew by 5% in fiscal year 2022, reaching $44.5 billion."
}
The project follows a modular, object-oriented design:
simulator.py
: Contains the mainDataSimulator
class that orchestrates the data generation processllm.py
: Houses theLLMProcessor
class that handles all LLM-related operationsdocument_processor.py
: Provides theDocumentProcessor
class for loading and chunking documentsprompts.py
: Stores all prompt templates used for LLM interactionsutils.py
: Contains utility functions likedisplay_results
for formatting output
MIT License