Data Simulator

data-simulator is a lightweight Python library for generating synthetic datasets from your own corpus — perfect for testing, evaluating, or fine-tuning LLM Applications.

Motivation

Real documents contain a mix of useful and irrelevant content. When generating synthetic data, this leads to:

Queries that real users would never ask
Test sets that don't reflect actual usage
Wasted effort optimizing for the wrong things

Data Simulator filters out low-quality content first, then generates realistic queries and answers that match how your system will actually be used.

Getting Started

Install from PyPI:

pip install llm-data-simulator

Or install it locally:

git clone https://github.com/langwatch/data-simulator.git
cd data-simulator
pip install -e .

Run the built-in test script:

python test.py

Example test.py

from data_simulator import DataSimulator
from dotenv import load_dotenv
import os
from data_simulator.utils import display_results

load_dotenv()

generator = DataSimulator(api_key=os.getenv("OPENAI_API_KEY"))

results = generator.generate_from_docs(
    file_paths=["test_data/nike_10k.pdf"],
    context="You're a financial support assistant for Nike, helping a financial analyst decide whether to invest in the stock.",
    example_queries="how much revenue did nike make last year\nwhat risks does nike face\nwhat are nike's top 3 priorities"
)

display_results(results)

Output Format

{
  "id": "chunk_42",
  "document": "Nike reported annual revenue of $44.5 billion for fiscal year 2022, an increase of 5% compared to the previous year.",
  "query": "What was Nike's revenue growth in 2022?",
  "answer": "Nike's revenue grew by 5% in fiscal year 2022, reaching $44.5 billion."
}

Project Structure

The project follows a modular, object-oriented design:

simulator.py: Contains the main DataSimulator class that orchestrates the data generation process
llm.py: Houses the LLMProcessor class that handles all LLM-related operations
document_processor.py: Provides the DocumentProcessor class for loading and chunking documents
prompts.py: Stores all prompt templates used for LLM interactions
utils.py: Contains utility functions like display_results for formatting output

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data_simulator		data_simulator
test_data		test_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data_simulator.ipynb		data_simulator.ipynb
pyproject.toml		pyproject.toml
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Simulator

Motivation

Getting Started

Example test.py

Output Format

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

langwatch/data-simulator

Folders and files

Latest commit

History

Repository files navigation

Data Simulator

Motivation

Getting Started

Example test.py

Output Format

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages