Data Balance Simulator CLI

To run Swift REPL in a docker container run:

docker run --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined swift:5.10.1 swift repl

Run

To run the project in a container, then:

make run CONFIG_FILE_PATH=config-files/base_config.json SIMULATOR_ARGS=[...]

If the environment already have Swift installed (e.g. when you are developing using VSCode devcontainer feature):

make run IS_DEVCONTAINER=true CONFIG_FILE_PATH=config-files/base_config.json SIMULATOR_ARGS=[...]

Simulation Complexity

The number of simulations is determined by the execution parameters:

$\sum_{s = MinServices}^{MaxServices} \sum_{n = MinNodes}^{MaxNodes} \sum_{w = 1}^{min(n, MaxWindowSize)} s^{w} * (n - w + 1) + n$

An execution with:

$ nodes = 5 \newline services = 6 \newline maxWindowSize = 4 \newline $

Includes the following number of samplings:

$ winSize = 1 \to samplings = (6^{1}) * 5 + 5\newline winSize = 2 \to samplings = (6^{2}) * 4 + 5\newline winSize = 3 \to samplings = (6^{3}) * 3 + 5\newline winSize = 4 \to samplings = (6^{4}) * 2 + 5\newline $

$6^{x}$ represents the number of combinations in a window, which is multiplied by the number of windows in a simulation. After we choose the service, it is executed and the resulting dataset is stored and cached. This is the meaning of $+ n$ (one service for each node).

Datasets

Datasets are located in the datasets folder. The following table describes the characteristics of each dataset:

Dataset	Average of Columns entropy	Variance of Columns entropy	Std Dev of Columns entropy
high_variability	11.80	0.24	0.49
low_variability	1.7	0.2	0.45
inmates_enriched_10k	5.35	13.09	3.62
IBM_HR_Analytics_employee_attrition	3.13	8.56	2.93
red_wine_quality	5.61	2.01	1.42
avocado	9.36	22.13	4.7

To compute the entropy of each column:

import pandas as pd
import numpy as np
from typing import Dict

dataset = pd.read_csv(dataset_name + ".csv")

dataset_size = len(dataset)

def get_column_frequency(column: pd.Series) -> pd.Series:
    return column.value_counts()

def get_column_probability(column: pd.Series) -> pd.Series:
    return column.value_counts(normalize=True)

def get_column_entropy(column: pd.Series) -> float:
    column_probability = get_column_probability(column)
    return -sum(column_probability * np.log2(column_probability))


entropies = [get_column_entropy(dataset[column]) for column in dataset.columns ]
print(f"{round(np.mean(entropies), 2)}, {round(np.var(entropies), 2)}, {round(np.std(entropies), 2)}")

Logging

To set the logger level, create an env variable called LOGGER_LEVEL with one of the following values: trace, debug, info, notice, warning, error, critical ( default is info). The alternative is to pass this variable to make run.

DB Migrations and DB queries

For DB migration, run make migrate-db SQL_CODE="your_migration_sql".

To run queries on DB, run make run-query SQL_CODE="your_plain_sql".

K8s

Inside the k8s/ folder, there are all the resources to run a simulation on k8s. After cd k8s, here are some Makefile recipees:

install: deploy the setup resources
uninstall: uninstall the setup resources

run-simulation: run a simulation. Example:

make run-simulation NAME=test-sim VALUES_FILE=./run-simulation/files/base-params.yaml`

delete-simulation: uninstall the resources created for the simulation
copy-dataset: copy a dataset into the volume used by running a simulation. Example:
```
make copy-dataset FILE_PATH=path/to/dataset.csv
```
query-db: open a sqlite connection with the db specified (defaults to simulations.db). Example:
```
make query-db DB_PATH=simulations.db
```

Deepnote experiments

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.vscode		.vscode
Sources/DataBalanceSimulator		Sources/DataBalanceSimulator
Tests/DataBalanceSimulatorTests		Tests/DataBalanceSimulatorTests
config-files		config-files
datasets		datasets
db/init		db/init
k8s		k8s
notebooks		notebooks
python-modules		python-modules
results		results
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
Package.resolved		Package.resolved
Package.swift		Package.swift
README.md		README.md
docker-compose.yml		docker-compose.yml
loop-simulations.sh		loop-simulations.sh
requirements.txt		requirements.txt
run-simulation.sh		run-simulation.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Balance Simulator CLI

Run

Simulation Complexity

Datasets

Logging

DB Migrations and DB queries

K8s

About

Releases

Packages

Languages

SESARLab/data-quality-simulator

Folders and files

Latest commit

History

Repository files navigation

Data Balance Simulator CLI

Run

Simulation Complexity

Datasets

Logging

DB Migrations and DB queries

K8s

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages