tpch-datagen - by GizmoData™

A utility to generate TPC-H data in parallel using DuckDB and multi-processing

Why?

Because generating TPC-H data can be time-consuming and resource-intensive. This project provides a way to generate TPC-H data in parallel using DuckDB and multi-processing.

Setup (to run locally)

Install Python package

You can install tpch-datagen from PyPi or from source.

Option 1 - from PyPi

# Create the virtual environment
python3 -m venv .venv

# Activate the virtual environment
. .venv/bin/activate

pip install tpch-datagen

Option 2 - from source - for development

git clone https://github.com/gizmodata/tpch-datagen

cd tpch-datagen

# Create the virtual environment
python3 -m venv .venv

# Activate the virtual environment
. .venv/bin/activate

# Upgrade pip, setuptools, and wheel
pip install --upgrade pip setuptools wheel

# Install TPC-H Datagen - in editable mode with client and dev dependencies
pip install --editable .[dev]

Note

For the following commands - if you running from source and using --editable mode (for development purposes) - you will need to set the PYTHONPATH environment variable as follows:

export PYTHONPATH=$(pwd)/src

Usage

Here are the options for the tpch-datagen command:

tpch-datagen --help
Usage: tpch-datagen [OPTIONS]

Options:
  --version / --no-version        Prints the TPC-H Datagen package version and
                                  exits.  [required]
  --scale-factor INTEGER          The TPC-H Scale Factor to use for data
                                  generation.
  --data-directory TEXT           The target output data directory to put the
                                  files into  [default: data; required]
  --work-directory TEXT           The work directory to use for data
                                  generation.  [default: /tmp; required]
  --overwrite / --no-overwrite    Can we overwrite the target directory if it
                                  already exists...  [default: no-overwrite;
                                  required]
  --num-chunks INTEGER            The number of chunks that will be generated
                                  - more chunks equals smaller memory
                                  requirements, but more files generated.
                                  [default: 10; required]
  --num-processes INTEGER         The maximum number of processes for the
                                  multi-processing pool to use for data
                                  generation.  [default: 10; required]
  --duckdb-threads INTEGER        The number of DuckDB threads to use for data
                                  generation (within each job process).
                                  [default: 1; required]
  --per-thread-output / --no-per-thread-output
                                  Controls whether to write the output to a
                                  single file or multiple files (for each
                                  process).  [default: per-thread-output;
                                  required]
  --compression-method [none|snappy|gzip|zstd]
                                  The compression method to use for the
                                  parquet files generated.  [default: zstd;
                                  required]
  --file-size-bytes TEXT          The target file size for the parquet files
                                  generated.  [default: 100m; required]
  --help                          Show this message and exit.

Note

Default values may change depending on the number of CPU cores you have, etc.

Handy development commands

Version management

Bump the version of the application - (you must have installed from source with the [dev] extras)

bumpver update --patch

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
sql		sql
src/tpch_datagen		src/tpch_datagen
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

tpch-datagen - by GizmoData™

Why?

Setup (to run locally)

Install Python package

Option 1 - from PyPi

Option 2 - from source - for development

Note

Usage

Handy development commands

Version management

Bump the version of the application - (you must have installed from source with the [dev] extras)

About

Uh oh!

Releases 5

Packages

Uh oh!

Languages

License

gizmodata/tpch-datagen

Folders and files

Latest commit

History

Repository files navigation

tpch-datagen - by GizmoData™

Why?

Setup (to run locally)

Install Python package

Option 1 - from PyPi

Option 2 - from source - for development

Note

Usage

Handy development commands

Version management

Bump the version of the application - (you must have installed from source with the [dev] extras)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Languages

Packages