Skip to content

Commit

Permalink
Claudio
Browse files Browse the repository at this point in the history
  • Loading branch information
cardagna committed Mar 20, 2024
1 parent 1dadeec commit 6dce8d5
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 4 deletions.
7 changes: 5 additions & 2 deletions metrics.tex
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@ \section{Maximizing the Pipeline Instance Quality}\label{sec:heuristics}
%Inspired by existing literature, these metrics, categorized as quantitative and statistical, play a pivotal role in quantifying the impact of policy-driven transformations on the original dataset.

\subsection{Quality Metrics}\label{sec:metrics}
Ensuring data quality is mandatory to implement data pipelines that provide high-quality results and decision-making along the whole pipeline execution. To this aim, we define a set of metrics evaluating the quality loss introduced by our policy-driven transformation in Section~\cite{ADD} on the input dataset \origdataset at each step of the data pipeline. Our metrics can be classified as \emph{quantitative} and \emph{statistical}~\cite{ADD}, and compare the input dataset \origdataset\ and dataset \transdataset\ generated by enforcing data protection requirements on \origdataset.
Ensuring data quality is mandatory to implement data pipelines that provide high-quality results and decision-making along the whole pipeline execution. To this aim, we define two metrics evaluating the quality loss introduced by our policy-driven transformation in Section~\cite{ADD} on the input dataset \origdataset at each step of the data pipeline. Our metrics can be classified as \emph{quantitative} and \emph{qualitative}~\cite{ADD}, and compare the input dataset \origdataset\ and dataset \transdataset\ generated by enforcing data protection requirements on \origdataset.

Quantitative metrics monitor the amount of data lost during data transformations as the difference in quality between datasets \origdataset\ and \transdataset.
Statistical metrics take into consideration the changes in the statistical properties of datasets \origdataset\ and \transdataset. We note that these metrics can be applied either to the entire dataset or to specific features only. The features can be assigned with equal or varying importance, enabling the prioritization of important features that might be possibly lost during the policy-driven transformation in Section~\cite{ADD}.
Qualitative metrics take into consideration the changes in the properties of datasets \origdataset\ and \transdataset. For instance, qualitative metrics can measure the changes in the statistical distribution of the datasets.

\subsubsection{Jaccard coefficient}
The Jaccard coefficient can be used to measure the difference between the elements in two datasets.
Expand Down Expand Up @@ -39,6 +39,9 @@ \subsubsection{Jensen-Shannon Divergence}
%
It provides a more comprehensive understanding of the dissimilarity between X and Y, taking into account the characteristics of both datasets.

\vspace{0.5em}

We note that our metrics can be applied either to the entire dataset or to specific features only. The features can be assigned with equal or varying importance, enabling the prioritization of important features that might be possibly lost during the policy-driven transformation in Section~\cite{ADD}. A complete taxonomy of possible metrics is outside of the scope of this paper and will be the target of future work.

\subsection{NP-Hardness of the Max Quality Pipeline Instantiation Process}\label{sec:nphard}
\hl{se lo definiamo in maniera formale come il problema di trovare un'istanza valida in accordo alla definizione di istanza tale che non ne esiste una con un loss piu' piccolo?}
Expand Down
4 changes: 2 additions & 2 deletions system_model.tex
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,10 @@ \subsection{Service Pipeline and Reference Scenario}\label{sec:service_definitio
\begin{definition}[\pipeline]\label{def:pipeline}
% A \pipeline is as a direct acyclic graph G(\V,\E), where \V\ is a set of vertices and \E\ is a set of edges connecting two vertices \vi{i},\vi{k}$\in$\V. The graph has a root \vi{r}$\in$\V, a vertex \vi{i}$\in$\V$_S$ for each service $s_i$, two additional vertices \vi{c},\vi{m}$\in$\V$_{\timesOperator}$$\subset$\V\ for each alternative ($\timesOperator$) structure modeling the alternative execution (\emph{choice}) of operations and the retrieval (\emph{merge}) of the results, respectively, and one additional vertex \vi{f} $\in$\V$_{\plusOperator}$$\subset$\V\ for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of operations.
A \pipeline is as a direct acyclic graph G(\V,\E), where \V\ is a set of vertices and \E\ is a set of edges connecting two vertices \vi{i},\vi{k}$\in$\V.
The graph has a root \vi{r}$\in$\V, a vertex \vi{i}$\in$\V$_S$ for each service $s_i$, two additional vertices \vi{c},\vi{f}$\subset$\V\ for each alternative ($\timesOperator$) structure modeling the alternative execution (\emph{choice}) and for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of services, respectively.
The graph has a root \vi{r}$\in$\V, a vertex \vi{i}$\in$\V$_S$ for each service $s_i$, an additional vertex \vi{f}$\subset$\V\ for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of services.
\end{definition}

We note that \{\vi{r},\vi{c},\vi{f}\}$\cup$\V$_S$$=$\V, vertices \vi{c} and \vi{f} model branching for alternative/parallel structures, and root \vi{r} possibly represents the orchestrator.
We note that \{\vi{r},\vi{f}\}$\cup$\V$_S$$=$\V, vertices \vi{f} model branching for parallel structures, and root \vi{r} possibly represents the orchestrator. We also note that, for simplicity but no lack of generality, alternative structures modeling the alternative execution of services are not specified in a single service pipeline, but rather modeled as alternative service pipelines.

% A service pipeline is as a direct acyclic graph G(\V,\E), where \V\ is a set of vertices, one for each service $s_i$ in the pipeline, \E\ is a set of edges connecting two services $s_i$ and $s_j$, and \myLambda\ is an annotation function that assigns a label \myLambda(\vi{i}), corresponding to a data transformation \F\ implemented by the service $s_i$, for each vertex \vi{i}$\in$\V.
Our reference scenario considers a service pipeline analyzing a dataset of individuals detained in Department of Correction facilities in the state of Connecticut while awaiting trial \cite{toadd}.
Expand Down

0 comments on commit 6dce8d5

Please sign in to comment.