Skip to content

Commit

Permalink
Claudio
Browse files Browse the repository at this point in the history
  • Loading branch information
cardagna committed Mar 22, 2024
1 parent 6dce8d5 commit 91c2b01
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 21 deletions.
19 changes: 10 additions & 9 deletions metrics.tex
Original file line number Diff line number Diff line change
@@ -1,20 +1,21 @@
\section{Maximizing the Pipeline Instance Quality}\label{sec:heuristics}
%
% %Ovviamente non è sufficiente scegliere il best service per ogni vertice, ma diventa un problema complesso dove si devono calcolare/valutare tutte le possibili combinazioni dei servizi disponibili, tra le quali scegliere la migliore.
Our goal is to generate a pipeline instance with maximum quality, which addresses data protection requirements with the minimum amount of information loss across the pipeline. To this aim, we first discuss the crucial role of well-defined metrics (\cref{sec:metrics}) to specify and measure data quality, and describe the ones used in the paper.
Then, we prove that the problem of generating a pipeline instance with maximum quality is NP-hard (\cref{sec:nphard}). Finally, we present a parametric heuristic (\cref{subsec:heuristics}) tailored to address the computational complexities associated with enumerating all possible combinations within a given set. The primary aim of the heuristic is to approximate the optimal path for service interactions and transformations, particularly within the landscape of more complex pipelines composed of numerous nodes and candidate services.
Our focus extends beyond identifying optimal combinations, encompassing an understanding of the quality changes introduced during the transformation processes.
Our goal is to generate a pipeline instance with maximum quality, which addresses data protection requirements with the minimum amount of information loss across the pipeline execution. To this aim, we first discuss the crucial role of well-defined metrics (\cref{sec:metrics}) to specify and measure data quality, and describe the ones used in the paper.
Then, we prove that the problem of generating a pipeline instance with maximum quality is NP-hard (\cref{sec:nphard}). Finally, we present a parametric heuristic (\cref{subsec:heuristics}) tailored to address the computational complexity associated with enumerating all possible combinations within a given set. The primary aim of the heuristic is to approximate the optimal path for service interactions and transformations, particularly within the landscape of more complex pipelines composed of numerous nodes and candidate services. Our focus extends beyond identifying optimal combinations, encompassing an understanding of the quality changes introduced during the transformation processes.

%Inspired by existing literature, these metrics, categorized as quantitative and statistical, play a pivotal role in quantifying the impact of policy-driven transformations on the original dataset.

\subsection{Quality Metrics}\label{sec:metrics}
Ensuring data quality is mandatory to implement data pipelines that provide high-quality results and decision-making along the whole pipeline execution. To this aim, we define two metrics evaluating the quality loss introduced by our policy-driven transformation in Section~\cite{ADD} on the input dataset \origdataset at each step of the data pipeline. Our metrics can be classified as \emph{quantitative} and \emph{qualitative}~\cite{ADD}, and compare the input dataset \origdataset\ and dataset \transdataset\ generated by enforcing data protection requirements on \origdataset.
%Ensuring data quality is mandatory to implement data pipelines that provide accurate results and decision-making along the whole pipeline execution. To this aim, we define two metrics evaluating the quality loss introduced by our policy-driven transformation in Section~\cite{ADD} on the input dataset \origdataset at each step of the data pipeline. Our metrics can be classified as \emph{quantitative} and \emph{qualitative}~\cite{ADD}, and compare the input dataset \origdataset\ and dataset \transdataset\ generated by enforcing data protection requirements on \origdataset.
Ensuring data quality is mandatory to implement data pipelines that provide accurate results and decision-making along the whole pipeline execution. To this aim, quality metrics evaluate the quality loss introduced at each step of the data pipeline, and can be classified as \emph{quantitative} or \emph{qualitative}~\cite{ADD}.
Quantitative metrics monitor the amount of data lost during data transformations as the quality difference between datasets \origdataset\ and \transdataset.
Qualitative metrics evaluate changes in the properties of datasets \origdataset\ and \transdataset. For instance, qualitative metrics can measure the changes in the statistical distribution of the two datasets.

Quantitative metrics monitor the amount of data lost during data transformations as the difference in quality between datasets \origdataset\ and \transdataset.
Qualitative metrics take into consideration the changes in the properties of datasets \origdataset\ and \transdataset. For instance, qualitative metrics can measure the changes in the statistical distribution of the datasets.
In this paper, we provide two metrics, one quantitative and one qualitative, that compare the input dataset \origdataset\ and dataset \transdataset\ generated by enforcing data protection requirements (i.e., our policy-driven transformation in Section~\cite{ADD}) on \origdataset\ at each step of the data pipeline.

\subsubsection{Jaccard coefficient}
The Jaccard coefficient can be used to measure the difference between the elements in two datasets.
The Jaccard coefficient is a quantitative metric that can be used to measure the difference between the elements in two datasets.
It is defined as:\[J(X,Y) = \frac{|X \cap Y|}{|X \cup Y|}\]
where X and Y are two datasets of the same size.

Expand All @@ -24,7 +25,7 @@ \subsubsection{Jaccard coefficient}


\subsubsection{Jensen-Shannon Divergence}
The Jensen-Shannon divergence (JSD) is a symmetrized version of the KL divergence~\cite{ADD} and can be used to measure the dissimilarity between the probability distributions of two datasets.
The Jensen-Shannon divergence (JSD) is a quantitative metric that can be used to measure the dissimilarity between the probability distributions of two datasets. It is a symmetrized version of the KL divergence~\cite{ADD} .

The JSD between X and Y is defined as:

Expand All @@ -41,7 +42,7 @@ \subsubsection{Jensen-Shannon Divergence}

\vspace{0.5em}

We note that our metrics can be applied either to the entire dataset or to specific features only. The features can be assigned with equal or varying importance, enabling the prioritization of important features that might be possibly lost during the policy-driven transformation in Section~\cite{ADD}. A complete taxonomy of possible metrics is outside of the scope of this paper and will be the target of future work.
We note that our metrics can be applied either to the entire dataset or to specific features only. The features can be assigned with equal or varying importance, providing a weighted version of the metrics, thus enabling the prioritization of important features that might be possibly lost during the policy-driven transformation in Section~\cite{ADD}. A complete taxonomy of possible metrics is however outside the scope of this paper and will be the target of our future work.

\subsection{NP-Hardness of the Max Quality Pipeline Instantiation Process}\label{sec:nphard}
\hl{se lo definiamo in maniera formale come il problema di trovare un'istanza valida in accordo alla definizione di istanza tale che non ne esiste una con un loss piu' piccolo?}
Expand Down
20 changes: 8 additions & 12 deletions system_model.tex
Original file line number Diff line number Diff line change
Expand Up @@ -37,25 +37,26 @@ \subsection{Service Pipeline and Reference Scenario}\label{sec:service_definitio
The graph has a root \vi{r}$\in$\V, a vertex \vi{i}$\in$\V$_S$ for each service $s_i$, an additional vertex \vi{f}$\subset$\V\ for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of services.
\end{definition}

We note that \{\vi{r},\vi{f}\}$\cup$\V$_S$$=$\V, vertices \vi{f} model branching for parallel structures, and root \vi{r} possibly represents the orchestrator. We also note that, for simplicity but no lack of generality, alternative structures modeling the alternative execution of services are not specified in a single service pipeline, but rather modeled as alternative service pipelines.
We note that \{\vi{r},\vi{f}\}$\cup$\V$_S$$=$\V, vertices \vi{f} model branching for parallel structures, and root \vi{r} possibly represents the orchestrator. We also note that, for simplicity but no lack of generality, alternative structures modeling the alternative execution of services are specified as alternative service pipelines, that is, there is no alternative structure in a single service pipeline.

% A service pipeline is as a direct acyclic graph G(\V,\E), where \V\ is a set of vertices, one for each service $s_i$ in the pipeline, \E\ is a set of edges connecting two services $s_i$ and $s_j$, and \myLambda\ is an annotation function that assigns a label \myLambda(\vi{i}), corresponding to a data transformation \F\ implemented by the service $s_i$, for each vertex \vi{i}$\in$\V.
Our reference scenario considers a service pipeline analyzing a dataset of individuals detained in Department of Correction facilities in the state of Connecticut while awaiting trial \cite{toadd}.
In particular, the user, a member of the Connecticut Department of Correction (DOC), seeks to compare admission trends in Connecticut prisons with DOCs in New York and New Hampshire. We assume DOCs to be partners and share data with relaxed privacy policies.
In particular, the user, a member of the Connecticut Department of Correction (DOC), seeks to compare admission trends in Connecticut prisons with DOCs in New York and New Hampshire. We assume DOCs to be partners and share data according to their privacy policies.
The user's preferences align with a predefined pipeline template that orchestrates the following sequence of operations:
\begin{enumerate*}[label=(\roman*)]
\item \emph{Data fetching}, including the download of the dataset from other states;
\item \emph{Data preparation}, including data merging, cleaning and anonymization;
\item \emph{Data preparation}, including data merging, cleaning, and anonymization;
% \hl{QUESTO E' MERGE (M). IO PENSAVO DIVENTASSE UN NODO $v_i$. NEL CASO CAMBIANDO LA DEFINIZIONE 3.1 DOVE NON ESISTONO PIU' I NODI MERGE E JOIN.}
\item \emph{Data analysis}, including statistical measures like averages, medians, and clustering-based statistics;
\item \emph{Data analysis}, including statistical measures like average, median, and clustering-based statistics;
\item \emph{Machine learning task}, including training and inference;
\item \emph{Data storage}, including the storage of the results;
\item \emph{Data visualization}, including the visualization of the results.
\end{enumerate*}

We note that the template requires the execution of the entire service within the Connecticut Department of Correction.
If the data needs to be transmitted beyond the boundaries of Connecticut, data protection measures must be implemented.
A visual representation of the flow is presented in Figure \ref{fig:reference_scenario}.
We note that the template requires the execution of the entire service within the Connecticut Department of Correction. If the data needs to be transmitted beyond the boundaries of Connecticut, data protection measures must be implemented. A visual representation of the flow is presented in Figure \ref{fig:reference_scenario}.
%
\cref{tab:dataset} presents a sample of the adopted dataset.\footnote{https://data.ct.gov/Public-Safety/Accused-Pre-Trial-Inmates-in-Correctional-Faciliti/b674-jy6w} Each row represents an inmate; each column includes the following attributes: date of download, a unique identifier, last entry date, race, gender, age of the individual, the bound value, offense, entry facility, and detainer. To serve the objectives of our study, we have extended this dataset by introducing randomly generated first and last names.

\begin{table*}[ht!]
\caption{Dataset sample}
\label{tab:dataset}
Expand Down Expand Up @@ -140,11 +141,6 @@ \subsection{Service Pipeline and Reference Scenario}\label{sec:service_definitio
\label{fig:reference_scenario}
\end{figure}

\cref{tab:dataset} presents a sample of the adopted dataset.\footnote{https://data.ct.gov/Public-Safety/Accused-Pre-Trial-Inmates-in-Correctional-Faciliti/b674-jy6w}
Each row represents an inmate; each column includes the following attributes: date of download, a unique identifier, last entry date, race, gender, age of the individual, the bound value, offense, entry facility, and detainer.
To serve the objectives of our study, we have extended this dataset by introducing randomly generated first and last names.


% Scarichiamo tre dataset, nessuna anonimizzazione, nodo di merge, anonimizzo e pulisco tutto,
%nodi alternativa ML e analisi, merge, storage, visulazzionezione
%aggiungere nodo finale
Expand Down

0 comments on commit 91c2b01

Please sign in to comment.