Function Calling Dataset Generator

This repository contains the code used to generate the synthetic dataset for training Pythonic function calling model Dria-Agent-a-7B.

Overview

The data generation pipeline consists of three main stages, executed sequentially to produce high-quality synthetic data for function calling scenarios. The pipeline leverages the Dria framework to generate data using multiple models across edge devices.

Setup

This project uses uv for dependency management.

Make sure you have rust installed:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source "$HOME/.cargo/env"

Linux

Install the following dependencies for libsecp256k1 on linux:

sudo apt-get update
sudo apt-get install -y \
    automake \
    autoconf \
    libtool \
    pkg-config \
    libffi-dev \
    libssl-dev \
    python3-dev
sudo apt-get install -y build-essential

MacOS

Install xcode tools for gcc:

xcode-select --install

Install brew and dependencies:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install automake libtool pkg-config

If you're having issues on macOS, see:

Installation

Install uv:

pip install uv

Create and activate virtual environment:

uv venv
source .venv/bin/activate

Install dependencies:

uv pip install -e .

Pipeline Stages

Scenario Generation (run_s1.py)
- Generates base scenarios from curriculum
- Uses curriculum.csv as input
- Produces scenarios.json
Function Generation (run_s2.py)
- Generates function definitions and schemas
- Takes scenarios.json as input
- Produces functions.json
Conversation Generation (run_s3.py)
- Generates conversation flows and function calls
- Combines scenarios and functions
- Produces final dataset entries

Generated Data Types

Type	Description	Percentage
simple	Single function schema, single function call	27.6%
parallel	Single function schema, multiple function calls	27.5%
multiple	Multiple function schemas, single function call	12.2%
step_by_step*	Multiple function schemas, multiple function calls, with step by step reasoning	21.5%
multi_turn	Multiple function schemas, multiple function calls, multi-turn	11.1%

*Note: This repository does not include the code for generating the step_by_step category, which accounts for 21.5% of the final dataset.

Usage

Run the complete pipeline:

chmod +x start.sh
uv run ./start.sh

Or run stages separately:

uv run run_s1.py
uv run run_s2.py
uv run run_s3.py

And you're set. The pipeline will generate the dataset in the pipeline/data folder.

"Data generation takes time!" - Unknown

Pipeline Folder Structure

├── pipeline/
│   ├── data/              
│   ├── s1_scenario/           # Stage 1: Scenario generation
│   │   ├── __init__.py
│   │   ├── prompt.md
│   │   └── task.py
│   ├── s2_functions/          # Stage 2: Function generation
│   │   ├── __init__.py
│   │   ├── parser.py
│   │   ├── prompt.md
│   │   └── task.py
│   └── s3_queries/            # Stage 3: Query generation
│       ├── multiturn/         # Multi-turn conversation generation
│       │   ├── __init__.py
│       │   ├── prompt.md
│       │   └── task.py
│       ├── parallel/          # Parallel function calls generation
│       │   ├── __init__.py
│       │   ├── prompt.md
│       │   └── task.py
│       └── simple/            # Simple function calls generation
│           ├── __init__.py
│           ├── prompt.md
│           └── task.py

Output

The pipeline generates data in the following format:

{
    "id": string,
    "domain": string,
    "subdomain": string,
    "tools": string,
    "conversations": [
        {
            "content": string,
            "role": string
        }
    ],
    "type": string
}

License

Apache 2.0

Additional Information

Filtering and multi-turn data generation with RLEF is not included in this repo.

For more information about the generated dataset and its applications, see:

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
pipeline		pipeline
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
run_s1.py		run_s1.py
run_s2.py		run_s2.py
run_s3.py		run_s3.py
start.sh		start.sh
system_prompt.md		system_prompt.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Function Calling Dataset Generator

Overview

Setup

Linux

MacOS

Installation

Pipeline Stages

Generated Data Types

Usage

Pipeline Folder Structure

Output

License

Additional Information

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

firstbatchxyz/pythonic-function-calling-data

Folders and files

Latest commit

History

Repository files navigation

Function Calling Dataset Generator

Overview

Setup

Linux

MacOS

Installation

Pipeline Stages

Generated Data Types

Usage

Pipeline Folder Structure

Output

License

Additional Information

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages