Example Usage

Pasteur is a library for managing the end-to-end process of structured data synthesis. It features the algorithms MARE, PrivBayes, AIM, or MST to produce synthetic data and contains a variety of evaluation metrics and transformation tools for data. In addition, a collection of premade datasets is included, focusing on MIMIC-IV.

Example Usage

Preparing the work area

On the same directory, clone the following repositories:

git clone https://github.com/pasteur-dev/pasteur
# For privacy metrics, fork with a small change to allow calling the library
git clone https://github.com/antheas/syntheval

Then, install the dependencies in the same virtual environment:

# In the pasteur folder
cd pasteur
python3 -m venv venv
source venv/bin/activate
pip install -e .
pip install -e ../syntheval

# todo: include in the requirements
pip install tabulate

You can now place your data in the raw/ folder of the pasteur directory.

Synthesizing tabular datasets

With the commands below you can synthesize the tabular Adult and MIMIC-IV datasets.

Synthesizing the Adult Dataset

Adult is a tabular dataset that can be used to test synthesis works.

# Download and unzip
pasteur download --accept adult
pasteur bootstrap adult

# Ingest the datasets and views created from those
pasteur ingest_dataset adult
pasteur ingest_view tab_adult

# Run synthesis executions
## Normal hyperparameters
pasteur p tab_adult.privbayes

## Normal hyperparameters + run ingest_view, ingest_dataset
pasteur p tab_adult.privbayes --all

## Sweep (privacy budgets 1,2,5,10,100)
pasteur s tab_adult.privbayes -i i="range(5)" alg.etotal="[1,2,5,10,100][i]" alg.theta="[5,5,10,10,25][i]"

Synthesizing the MIMIC-IV Datasets

MIMIC-IV is a large dataset we can partition in a variety of ways to create synthetic datasets. You need physionet credentials to download the data.

Here, we run the MIMIC Admissions dataset, which is a tabular dataset created from the admissions table of MIMIC-IV when combined with the patients table, and the MIMIC billion dataset, which is a collection of columns from ICU Chart events, along with some columns from the patients and ICU stays tables duplicated to reach 1 billion rows.

# Download (you will be prompted for your credentials)
# takes a while to download
pasteur download --accept mimic_iv
# Data is not zipped, no bootstrap needed

# Ingest the MIMIC-IV tables
# Requires a lot of memory per worker. For 64GB of RAM, 5 workers are ok
# Takes ~45min for 5 workers
pasteur ingest_dataset mimic -w 5

#
# MIMIC Admissions
#

pasteur ingest_view mimic_tab_admissions

pasteur s mimic_tab_admissions.privbayes -i i="range(3)" alg.etotal="[0.01, 0.1, 1][i]" alg.theta="[5,5,10][i]" -p

#
# MIMIC Billion
#
# Workers during ingest are very memory intensive, 3 is good for 64GB RAM
pasteur ingest_view mimic_billion -w 3
# Needs more than 64gb of ram to run parallelized with e.g., more than 10 cores
pasteur p mimic_billion.privbayes alg.etotal=0.001

# You can view the resulting experiments with:
mlflow ui --backend-store-uri data/reporting/flow

Synthesizing relational data

With MARE, it is also possible to synthesize relational data in combination with PrivBayes. From the paper, two datasets are publicly available: MIMIC CORE, and MIMIC ICU Charts.

Those datasets are essentially relational versions of the MIMIC-ICU and MIMIC Core datasets. The difference is that unlike in those, the tables are not flattened, but kept in their relational form. This means that the tables are linked with foreign keys, and the one-to-many relationships are preserved.

Tne download and ingest commands from the previous section are skipped. So you should run those as well.

# Download (you will be prompted for your credentials)
# takes a while to download
pasteur download --accept mimic_iv
# Data is not zipped, no bootstrap needed

# Ingest the MIMIC-IV tables
# Requires a lot of memory per worker. For 64GB of RAM, 5 workers are ok
# Takes ~45min for 5 workers
pasteur ingest_dataset mimic -w 5

# Below are the commands that run the experiments shown in the MARE paper
# With MIMIC-IV.

#
# MIMIC Core
#

pasteur ingest_view mimic_core

# This is a privacy budget sweep. We also change the PrivBayes theta param
# to accoutn for the larger privacy budget.
pasteur s mimic_core.mare -i i="range(5)" alg.etotal="[1,2,5,10,100][i]" alg.theta="[5,5,10,10,25][i]" -p
# This is the ablation study from the MARE paper. It turns on/off components
# of the algorithm to see how they affect the results.
pasteur s mimic_core.mare -i noh='range(4)' alg.etotal="2" alg.theta='5' alg.no_hist='noh == 0 or noh == 2' alg.no_seq='noh == 0 or noh == 1' -p

#
# MIMIC-ICU
#
pasteur iv mimic_icu 

# Same setup as above.
pasteur s mimic_icu.mare -i i="range(5)" alg.etotal="[1,2,5,10,100][i]" alg.theta="[5,5,10,10,25][i]" -p
pasteur s mimic_icu.mare -i noh='range(4)' alg.etotal="2" alg.theta='5' alg.no_hist='noh == 0 or noh == 2' alg.no_seq='noh == 0 or noh == 1' -p

# You can view the resulting experiments with:
mlflow ui --backend-store-uri data/reporting/flow

Citations

This work has been part of two papers so far. If you use Pasteur in your work, please cite the first paper, and if you use the synthesis algorithm MARE, please cite the second paper as well:

Kapenekakis, A., Dell'Aglio, D., Bøgsted, M., Garofalakis, M., & Hose, K. (Accepted/In press). Pasteur: Scaling Privacy-aware Data Synthesis. In The 29th European Conference on Advances in Databases and Information Systems (ADBIS 2025). Springer.
A. Kapenekakis et al., "Synthesizing Accurate Relational Data under Differential Privacy," 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 2024, pp. 433-439, doi: 10.1109/BigData62323.2024.10825515.

The first paper covers the system itself, while the second paper focuses on the MARE algorithm for relational data synthesis. Of course, you should also cite the relevant tabular algorithms you use, e.g., PrivBayes, AIM, MST, etc that are not part of this work.

Acknowledgements

This project received funding from the European Union's Horizon 2020 research and innovation programme under Marie Skłodowska-Curie (grant No 955895), the Poul Due Jensens Fond (Grundfos Foundation), and the Novo Nordisk Foundation (grant number NNF23OC0083510).

Name		Name	Last commit message	Last commit date
Latest commit History 867 Commits
.github/workflows		.github/workflows
.ipython/profile_default		.ipython/profile_default
conf		conf
docs/source		docs/source
logs		logs
notebooks		notebooks
raw		raw
res/logo		res/logo
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.telemetry		.telemetry
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_PYPI.md		README_PYPI.md
pyproject.toml		pyproject.toml
requirements.in		requirements.in
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
stats.json		stats.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Example Usage

Preparing the work area

Synthesizing tabular datasets

Synthesizing the Adult Dataset

Synthesizing the MIMIC-IV Datasets

Synthesizing relational data

Citations

Acknowledgements

About

Uh oh!

Releases 5

Uh oh!

Languages

License

pasteur-dev/pasteur

Folders and files

Latest commit

History

Repository files navigation

Example Usage

Preparing the work area

Synthesizing tabular datasets

Synthesizing the Adult Dataset

Synthesizing the MIMIC-IV Datasets

Synthesizing relational data

Citations

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Uh oh!

Languages