From 6dce8d55cc227c520c4b1626318870b7ebb12393 Mon Sep 17 00:00:00 2001 From: Claudio Ardagna Date: Wed, 20 Mar 2024 18:10:35 +0100 Subject: [PATCH] Claudio --- metrics.tex | 7 +++++-- system_model.tex | 4 ++-- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/metrics.tex b/metrics.tex index 7ec8b74..c6dfeec 100644 --- a/metrics.tex +++ b/metrics.tex @@ -8,10 +8,10 @@ \section{Maximizing the Pipeline Instance Quality}\label{sec:heuristics} %Inspired by existing literature, these metrics, categorized as quantitative and statistical, play a pivotal role in quantifying the impact of policy-driven transformations on the original dataset. \subsection{Quality Metrics}\label{sec:metrics} -Ensuring data quality is mandatory to implement data pipelines that provide high-quality results and decision-making along the whole pipeline execution. To this aim, we define a set of metrics evaluating the quality loss introduced by our policy-driven transformation in Section~\cite{ADD} on the input dataset \origdataset at each step of the data pipeline. Our metrics can be classified as \emph{quantitative} and \emph{statistical}~\cite{ADD}, and compare the input dataset \origdataset\ and dataset \transdataset\ generated by enforcing data protection requirements on \origdataset. +Ensuring data quality is mandatory to implement data pipelines that provide high-quality results and decision-making along the whole pipeline execution. To this aim, we define two metrics evaluating the quality loss introduced by our policy-driven transformation in Section~\cite{ADD} on the input dataset \origdataset at each step of the data pipeline. Our metrics can be classified as \emph{quantitative} and \emph{qualitative}~\cite{ADD}, and compare the input dataset \origdataset\ and dataset \transdataset\ generated by enforcing data protection requirements on \origdataset. Quantitative metrics monitor the amount of data lost during data transformations as the difference in quality between datasets \origdataset\ and \transdataset. -Statistical metrics take into consideration the changes in the statistical properties of datasets \origdataset\ and \transdataset. We note that these metrics can be applied either to the entire dataset or to specific features only. The features can be assigned with equal or varying importance, enabling the prioritization of important features that might be possibly lost during the policy-driven transformation in Section~\cite{ADD}. +Qualitative metrics take into consideration the changes in the properties of datasets \origdataset\ and \transdataset. For instance, qualitative metrics can measure the changes in the statistical distribution of the datasets. \subsubsection{Jaccard coefficient} The Jaccard coefficient can be used to measure the difference between the elements in two datasets. @@ -39,6 +39,9 @@ \subsubsection{Jensen-Shannon Divergence} % It provides a more comprehensive understanding of the dissimilarity between X and Y, taking into account the characteristics of both datasets. +\vspace{0.5em} + +We note that our metrics can be applied either to the entire dataset or to specific features only. The features can be assigned with equal or varying importance, enabling the prioritization of important features that might be possibly lost during the policy-driven transformation in Section~\cite{ADD}. A complete taxonomy of possible metrics is outside of the scope of this paper and will be the target of future work. \subsection{NP-Hardness of the Max Quality Pipeline Instantiation Process}\label{sec:nphard} \hl{se lo definiamo in maniera formale come il problema di trovare un'istanza valida in accordo alla definizione di istanza tale che non ne esiste una con un loss piu' piccolo?} diff --git a/system_model.tex b/system_model.tex index d9d090e..5b3d950 100644 --- a/system_model.tex +++ b/system_model.tex @@ -34,10 +34,10 @@ \subsection{Service Pipeline and Reference Scenario}\label{sec:service_definitio \begin{definition}[\pipeline]\label{def:pipeline} % A \pipeline is as a direct acyclic graph G(\V,\E), where \V\ is a set of vertices and \E\ is a set of edges connecting two vertices \vi{i},\vi{k}$\in$\V. The graph has a root \vi{r}$\in$\V, a vertex \vi{i}$\in$\V$_S$ for each service $s_i$, two additional vertices \vi{c},\vi{m}$\in$\V$_{\timesOperator}$$\subset$\V\ for each alternative ($\timesOperator$) structure modeling the alternative execution (\emph{choice}) of operations and the retrieval (\emph{merge}) of the results, respectively, and one additional vertex \vi{f} $\in$\V$_{\plusOperator}$$\subset$\V\ for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of operations. A \pipeline is as a direct acyclic graph G(\V,\E), where \V\ is a set of vertices and \E\ is a set of edges connecting two vertices \vi{i},\vi{k}$\in$\V. - The graph has a root \vi{r}$\in$\V, a vertex \vi{i}$\in$\V$_S$ for each service $s_i$, two additional vertices \vi{c},\vi{f}$\subset$\V\ for each alternative ($\timesOperator$) structure modeling the alternative execution (\emph{choice}) and for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of services, respectively. + The graph has a root \vi{r}$\in$\V, a vertex \vi{i}$\in$\V$_S$ for each service $s_i$, an additional vertex \vi{f}$\subset$\V\ for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of services. \end{definition} -We note that \{\vi{r},\vi{c},\vi{f}\}$\cup$\V$_S$$=$\V, vertices \vi{c} and \vi{f} model branching for alternative/parallel structures, and root \vi{r} possibly represents the orchestrator. +We note that \{\vi{r},\vi{f}\}$\cup$\V$_S$$=$\V, vertices \vi{f} model branching for parallel structures, and root \vi{r} possibly represents the orchestrator. We also note that, for simplicity but no lack of generality, alternative structures modeling the alternative execution of services are not specified in a single service pipeline, but rather modeled as alternative service pipelines. % A service pipeline is as a direct acyclic graph G(\V,\E), where \V\ is a set of vertices, one for each service $s_i$ in the pipeline, \E\ is a set of edges connecting two services $s_i$ and $s_j$, and \myLambda\ is an annotation function that assigns a label \myLambda(\vi{i}), corresponding to a data transformation \F\ implemented by the service $s_i$, for each vertex \vi{i}$\in$\V. Our reference scenario considers a service pipeline analyzing a dataset of individuals detained in Department of Correction facilities in the state of Connecticut while awaiting trial \cite{toadd}.