GitHub - DISLMcGill/pandas-query-generator: A synthetic pandas query generation tool

Pandas Query Generator 🐼

Pandas Query Generator (pqg) is a tool designed to help users generate synthetic pandas queries for training machine learning models that estimate query execution costs or predict cardinality.

The distributed Python package is called pqg, and has only been tested on a unix-based system.

Web interface showing query statistics after generation

Generated query output and execution results with the CLI

Installation

You can install the query generator using pip, the Python package manager:

pip install pqg

Alternatively, you can check out the hosted web playground if you're interested in generating queries fast with little setup.

Usage

The query generator exposes both a command-line tool and library interface.

CLI

Below is the standard output of pqg --help, which elaborates on the various command-line arguments the tool accepts:

usage: pqg [--disable-multi-processing] [--ensure-non-empty] [--filter] [--groupby-aggregation-probability] [--max-groupby-columns] [--max-merges] [--max-projection-columns] [--max-selection-conditions] [--multi-line] --num-queries [--output-file] [--projection-probability] --schema [--selection-probability] [--sort] [--verbose]

Pandas Query Generator CLI

options:
  -h --help Show this help message and exit
  --disable-multi-processing Generate and execute queries in a consecutive fashion (default: False)
  --ensure-non-empty Ensure generated queries return a non-empty result set when executed on sample data (default: False)
  --filter Filter generated queries by specific criteria
  --groupby-aggregation-probability Probability of including groupby aggregation operations (default: 0.5)
  --max-groupby-columns Maximum number of columns in group by operations (default: 5)
  --max-merges Maximum number of table merges allowed (default: 2)
  --max-projection-columns Maximum number of columns to project (default: 5)
  --max-selection-conditions Maximum number of conditions in selection operations (default: 5)
  --multi-line Format queries on multiple lines (default: False)
  --num-queries num_queries The number of queries to generate
  --output-file The name of the file to write the results to
  --projection-probability Probability of including projection operations (default: 0.5)
  --schema schema Path to the relational schema JSON file
  --selection-probability Probability of including selection operations (default: 0.5)
  --sort Whether or not to sort the queries by complexity (default: False)
  --verbose Print extra generation information and statistics (default: False)

The required options, as shown, are --num-queries and --schema. The --num-queries option simply instructs the program to generate a certain amount of queries.

The --schema option is a pointer to a JSON file path that describes meta-information about the data we're generating queries for.

A sample schema looks like this:

{
  "entities": {
    "customer": {
      "primary_key": "C_CUSTKEY",
      "properties": {
        "C_CUSTKEY": { "type": "int", "min": 1, "max": 1000 },
        "C_NAME": { "type": "string", "starting_character": ["A", "B", "C"] },
        "C_STATUS": { "type": "enum", "values": ["active", "inactive"] }
      },
      "foreign_keys": {}
    },
    "order": {
      "primary_key": "O_ORDERKEY",
      "properties": {
        "O_ORDERKEY": { "type": "int", "min": 1, "max": 5000 },
        "O_CUSTKEY": { "type": "int", "min": 1, "max": 1000 },
        "O_TOTALPRICE": { "type": "float", "min": 10.0, "max": 1000.0 },
        "O_ORDERSTATUS": {
          "type": "enum",
          "values": ["pending", "completed", "cancelled"]
        }
      },
      "foreign_keys": {
        "O_CUSTKEY": ["C_CUSTKEY", "customer"]
      }
    }
  }
}

This file can be found in /examples/customer/schema.json, generate a few queries from this schema with pqg --num-queries 100 --schema examples/customer/schema.json --verbose.

Other example schema files can be found under the /examples directory.

Library

We expose various structures that make it easy to generate queries fast:

from pqg import Generator, GenerateOptions, Schema, QueryStructure, QueryPool, QueryFilter

# Assumes `schema.json` exists and conforms to the schema format
schema = Schema.from_file('schema.json')

query_structure = QueryStructure(
  groupby_aggregation_probability=0.5,
  max_groupby_columns=4,
  max_merges=10,
  max_projection_columns=5,
  max_selection_conditions=10,
  projection_probability=0.5,
  selection_probability=0.5
)

generator = Generator(schema, query_structure)

# Generate 1000 queries
generate_options = GenerateOptions(num_queries=1000)
query_pool: QueryPool = generator.generate(generate_options)

# Filter out queries with non-empty result sets
query_pool.filter(QueryFilter.NON_EMPTY)

# Sort queries by complexity
query_pool.sort()

# Output each query
print(*query_pool, sep='\n\n')

Comprehensive internal documentation is generated using the sphinx Python package, and a live instance is hosted alongside the web playground.

How does it work?

Check out the paper in the docs folder or watch the demo for more information!

Prior Art

This version of the Pandas Query Generator is based off of the thorough research work of previous students of COMP 400 at McGill University, namely Ege Satir, Hongxin Huo and Dailun Li.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github/workflows		.github/workflows
assets		assets
bin		bin
docs		docs
examples		examples
src/pqg		src/pqg
tests		tests
www		www
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
.tokeirc		.tokeirc
README.md		README.md
bun.lockb		bun.lockb
justfile		justfile
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pandas Query Generator 🐼

Installation

Usage

CLI

Library

How does it work?

Prior Art

About

Uh oh!

Releases

Packages

Uh oh!

Languages

DISLMcGill/pandas-query-generator

Folders and files

Latest commit

History

Repository files navigation

Pandas Query Generator 🐼

Installation

Usage

CLI

Library

How does it work?

Prior Art

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages