Skip to content

Simulator to find best options of privacy and quality in a data pipeline

Notifications You must be signed in to change notification settings

SESARLab/data-quality-simulator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Balance Simulator CLI

To run Swift REPL in a docker container run:

docker run --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined swift:5.10.1 swift repl

Run

To run the project in a container, then:

make run CONFIG_FILE_PATH=config-files/base_config.json SIMULATOR_ARGS=[...]

If the environment already have Swift installed (e.g. when you are developing using VSCode devcontainer feature):

make run IS_DEVCONTAINER=true CONFIG_FILE_PATH=config-files/base_config.json SIMULATOR_ARGS=[...]

Simulation Complexity

The number of simulations is determined by the execution parameters:

$\sum_{s = MinServices}^{MaxServices} \sum_{n = MinNodes}^{MaxNodes} \sum_{w = 1}^{min(n, MaxWindowSize)} s^{w} * (n - w + 1) + n$

An execution with:

$ nodes = 5 \newline services = 6 \newline maxWindowSize = 4 \newline $

Includes the following number of samplings:

$ winSize = 1 \to samplings = (6^{1}) * 5 + 5\newline winSize = 2 \to samplings = (6^{2}) * 4 + 5\newline winSize = 3 \to samplings = (6^{3}) * 3 + 5\newline winSize = 4 \to samplings = (6^{4}) * 2 + 5\newline $

$6^{x}$ represents the number of combinations in a window, which is multiplied by the number of windows in a simulation. After we choose the service, it is executed and the resulting dataset is stored and cached. This is the meaning of $+ n$ (one service for each node).


Datasets

Datasets are located in the datasets folder. The following table describes the characteristics of each dataset:

Dataset Average of Columns entropy Variance of Columns entropy Std Dev of Columns entropy
high_variability 11.80 0.24 0.49
low_variability 1.7 0.2 0.45
inmates_enriched_10k 5.35 13.09 3.62
IBM_HR_Analytics_employee_attrition 3.13 8.56 2.93
red_wine_quality 5.61 2.01 1.42
avocado 9.36 22.13 4.7

To compute the entropy of each column:

import pandas as pd
import numpy as np
from typing import Dict

dataset = pd.read_csv(dataset_name + ".csv")

dataset_size = len(dataset)

def get_column_frequency(column: pd.Series) -> pd.Series:
    return column.value_counts()

def get_column_probability(column: pd.Series) -> pd.Series:
    return column.value_counts(normalize=True)

def get_column_entropy(column: pd.Series) -> float:
    column_probability = get_column_probability(column)
    return -sum(column_probability * np.log2(column_probability))


entropies = [get_column_entropy(dataset[column]) for column in dataset.columns ]
print(f"{round(np.mean(entropies), 2)}, {round(np.var(entropies), 2)}, {round(np.std(entropies), 2)}")

Logging

To set the logger level, create an env variable called LOGGER_LEVEL with one of the following values: trace, debug, info, notice, warning, error, critical ( default is info). The alternative is to pass this variable to make run.


DB Migrations and DB queries

For DB migration, run make migrate-db SQL_CODE="your_migration_sql".

To run queries on DB, run make run-query SQL_CODE="your_plain_sql".


K8s

Inside the k8s/ folder, there are all the resources to run a simulation on k8s. After cd k8s, here are some Makefile recipees:

  • install: deploy the setup resources
  • uninstall: uninstall the setup resources
  • run-simulation: run a simulation. Example:
    make run-simulation NAME=test-sim VALUES_FILE=./run-simulation/files/base-params.yaml`
  • delete-simulation: uninstall the resources created for the simulation
  • copy-dataset: copy a dataset into the volume used by running a simulation. Example:
    make copy-dataset FILE_PATH=path/to/dataset.csv
  • query-db: open a sqlite connection with the db specified (defaults to simulations.db). Example:
    make query-db DB_PATH=simulations.db

Deepnote experiments