diff --git a/macro.tex b/macro.tex index 9cbd271..309751b 100644 --- a/macro.tex +++ b/macro.tex @@ -63,8 +63,8 @@ \newcommand{\pipeline}{Pipeline\xspace} \newcommand{\pipelineTemplate}{Pipeline Template\xspace} \newcommand{\pipelineInstance}{Pipeline Instance\xspace} -\newcommand{\quality}{quality\xspace} -\newcommand{\Quality}{Quality\xspace} +\newcommand{\quality}{\textit{quality}\xspace} +\newcommand{\Quality}{\textit{Quality}\xspace} \newcommand{\q}{$q$\xspace} \newcommand{\pone}{$(service\_owner=dataset\_owner)$} \newcommand{\ptwo}{$(service\_owner=partner(dataset\_owner))$} diff --git a/metrics.tex b/metrics.tex index 8df0499..b3aa3ce 100644 --- a/metrics.tex +++ b/metrics.tex @@ -37,23 +37,24 @@ \subsubsection{Qualitative Metric} where X and Y are two distributions of the same size, and M$=$0.5*(X+Y) is the average distribution. JSD incorporates both the KL divergence from X to M and from Y to M. -To make JSD applicable to datasets, where each feature in the dataset has its own statistical distribution, metric $M_{JDS}$ applies JSD to each column of the dataset. The obtained results are then aggregated using a weighted average, thus enabling the prioritization of important features that can be lost during the policy-driven transformation in \cref{sec:heuristics}, as follows: \[M_{JDS} = \sum_{i=1}^n w_i \cdot \text{JSD}(x_i,y_i)\] +To make JSD applicable to datasets, where each feature in the dataset has its own statistical distribution, metric $M_{JDS}$ applies JSD to each column of the dataset. The obtained results are then aggregated using a weighted average, thus enabling the prioritization of important features that can be lost during the policy-driven transformation in \cref{sec:heuristics}, as follows: \[M_{JDS} = 1 - \sum_{i=1}^n w_i \cdot \text{JSD}(x_i,y_i)\] where \(w_i = \frac{n_i}{N}\) represents the weight for the \(i\)-th column, with \(n_i\) being the number of distinct elements in the $i$-th feature and \(N\) the total number of elements in the dataset. Each \(\text{JSD}(x_i,y_i)\) accounts for the Jensen-Shannon Divergence computed for the \(i\)-th feature in datasets X and Y. +Must be noted that the one minus has been added to the formula to transfrom the metric into a similarity metric, where 1 indicates complete similarity and 0 indicates no similarity. -$M_{JDS}$ provides a weighted measure of dissimilarity, which is symmetric and accounts for the contribution from both datasets and specific features. It can compare the dissimilarity of the two datasets, providing a symmetric and normalized measure that considers the overall data distributions. +$M_{JDS}$ provides a weighted measure of similarity, which is symmetric and accounts for the contribution from both datasets and specific features. It can compare the similarity of the two datasets, providing a symmetric and normalized measure that considers the overall data distributions. -\subsubsection{Information Loss} +\subsubsection{\Quality (\q) Definition} %We note that our metrics can be applied either to the entire dataset or to specific features only. The features can be assigned with equal or varying importance, providing a weighted version of the metrics, thus enabling the prioritization of important features that might be possibly lost during the policy-driven transformation in Section~\cite{ADD}. -Metrics $M_J$ and $M_{JDS}$ contribute to the calculation of the information loss \textit{dloss} throughout the pipeline execution as follows. %Information loss is calculated as the average \emph{AVG} of data at each vertex \vi{i}$\in$$\V_S$ of the service pipeline $G(V,E)$ as follows. +Metrics $M_J$ and $M_{JDS}$ contribute to the calculation of the information \quality \textit{\q} throughout the pipeline execution as follows. %Information loss is calculated as the average \emph{AVG} of data at each vertex \vi{i}$\in$$\V_S$ of the service pipeline $G(V,E)$ as follows. -\begin{definition}[\emph{dloss}] - Given a metrics M$\in$$\{M_J,M_{JDS}$\} modeling the data quality, information loss \textit{dloss} is calculated as 1$-$\emph{AVG}($M_{ij}$), with $M_{ij}$ the value of the quality metric retrieved at each vertex \vii{i}$\in$$\V'_S$ of the pipeline instance $G'$ according to service \sii{j}. +\begin{definition}[\emph{\quality}] + Given a metrics M$\in$$\{M_J,M_{JDS}$\} modeling the data quality, \quality is calculated as \emph{AVG}($M_{ij}$), with $M_{ij}$ the value of the quality metric retrieved at each vertex \vii{i}$\in$$\V'_S$ of the pipeline instance $G'$ according to service \sii{j}. \end{definition} -We note that \emph{AVG}($M_{ij}$) models the average data quality preserved within the pipeline instance $G'$ -We also note that \textit{dloss}$_{ij}$$=$1$-$$M_i$ models the quality loss at vertex \vii{i}$\in$$\V'_S$ of $G'$ for service \sii{j}. +We note that \emph{AVG}($M_{ij}$) models the average data quality preserved within the pipeline instance $G'$. +We also note that $q_{ij}$$=$$M_i$ models the \quality at vertex \vii{i}$\in$$\V'_S$ of $G'$ for service \sii{j}. %We also note that information loss \textit{dloss} is used to generate the Max-Quality pipeline instance in the remaining of this section. \subsection{NP-Hardness of the Max-Quality Pipeline Instantiation Problem}\label{sec:nphard} @@ -64,7 +65,7 @@ \subsection{NP-Hardness of the Max-Quality Pipeline Instantiation Problem}\label Given a pipeline template $G^{\myLambda,\myGamma}$ and a set $S^c$ of candidate services, find a max-quality pipeline instance $G'$ such that: \begin{itemize} \item $G'$ satisfies conditions in \cref{def:instance}, - \item $\nexists$ a pipeline instance $G''$ that satisfies conditions in \cref{def:instance} and such that information loss \textit{dloss}($G''$)$<$\textit{dloss}($G'$), where \textit{dloss}($\cdot$) is the information loss throughout the pipeline execution. + \item $\nexists$ a pipeline instance $G''$ that satisfies conditions in \cref{def:instance} and such that \quality \textit{\q}($G''$)$>$\textit{\q}($G'$), where \textit{\q}($\cdot$) is the \quality throughout the pipeline execution. %computed after applying the transformation of the policy matching the service selected to instantiate vertex \vi{i}$\in$$\V_S$, . \end{itemize} \end{definition} @@ -77,13 +78,13 @@ \subsection{NP-Hardness of the Max-Quality Pipeline Instantiation Problem}\label \emph{Proof: } The proof is a reduction from the multiple-choice knapsack problem (MCKP), a classified NP-hard combinatorial optimization problem, which is a generalization of the simple knapsack problem (KP) \cite{}. In the MCKP problem, there are $t$ mutually disjoint classes $N_1,N_2,\ldots,N_t$ of items to pack in some knapsack of capacity $C$, class $N_i$ having size $n_i$. Each item $j$$\in$$N_i$ has a profit $p_{ij}$ and a weight $w_{ij}$; the problem is to choose one item from each class such that the profit sum is maximized without having the weight sum to exceed C. -The MCKP can be reduced to the Max quality \problem in polynomial time, with $N_1,N_2,\ldots,N_t$ corresponding to $S^c_{1}, S^c_{1}, \ldots, S^c_{u},$, $t$$=$$u$ and $n_i$ the size of $S^c_{i}$. The profit $p_{ij}$ of item $j$$\in$$N_i$ corresponds to \textit{dloss}$_{ij}$ computed for each candidate service $s_j$$\in$$S^c_{i}$, while $w_{ij}$ is uniformly 1 (thus, C is always equal to the cardinality of $V_C$). +The MCKP can be reduced to the Max quality \problem in polynomial time, with $N_1,N_2,\ldots,N_t$ corresponding to $S^c_{1}, S^c_{1}, \ldots, S^c_{u},$, $t$$=$$u$ and $n_i$ the size of $S^c_{i}$. The profit $p_{ij}$ of item $j$$\in$$N_i$ corresponds to \textit{\q}$_{ij}$ computed for each candidate service $s_j$$\in$$S^c_{i}$, while $w_{ij}$ is uniformly 1 (thus, C is always equal to the cardinality of $V_C$). Since the reduction can be done in polynomial time, our problem is also NP-hard. (non รจ sufficiente, bisogna provare che la soluzione di uno e' anche soluzione dell'altro) \begin{example}[Max-Quality Pipeline Instance] - Let us start from \cref{ex:instance} and extend it with the comparison algorithm in \cref{sec:instance} built on \emph{dloss}. The comparison algorithm is applied to the set of services $S'_*$ and returns three service rankings one for each vertex \vi{4}, \vi{5}, \vi{6} according to the amount of data anonymized. + Let us start from \cref{ex:instance} and extend it with the comparison algorithm in \cref{sec:instance} built on \quality. The comparison algorithm is applied to the set of services $S'_*$ and returns three service rankings one for each vertex \vi{4}, \vi{5}, \vi{6} according to the amount of data anonymized. The ranking is listed in \cref{tab:instance_example_maxquality}(b) and based on the transformation function in the policies. We assume that the more restrictive the transformation function (i.e., it anonymizes more data), the lower is the service position in the ranking. For example, \s{11} is ranked first because it anonymizes less data than \s{12} and \s{13}. The ranking of \s{22} and \s{23} is based on the same logic. @@ -103,9 +104,9 @@ \subsection{Heuristic}\label{subsec:heuristics} %The exhaustive exploration of such combinations swiftly becomes impractical in terms of computational time and resources, particularly when dealing with the analysis of complex pipelines. %In response to this computational complexity, the incorporation of heuristic emerges as a strategy to try to efficiently address the problem. %\hl{HO RIVISTO IL PARAGRAFO VELOCEMENTE GIUSTO PER DARE UN'INDICAZIONE. DOBBIAMO USARE LA FORMALIZZAZIONE E MAGARI FORMALIZZARE ANCHE LO PSEUDOCODICE.} -We design and implement a heuristic algorithm for computing the pipeline instance maximizing data quality. Our heuristic is built on a \emph{sliding window} and aims to minimize information loss \emph{dloss} according to quality metrics. At each step, a set of vertices in the pipeline template $\tChartFunction$ is selected according to a specific window size \windowsize, that select a subset of the pipeline template starting at depth $i$ and ending at depth \windowsize+i-1. Service filtering and selection in \cref{sec:instance} are then executed to minimize \emph{dloss} in window $w$. The heuristic returns as output the list of services instantiating all vertices at depth $i$. The sliding window $w$ is then shifted by 1 (i.e., $i$=$i$+1) and the filtering and selection process executed until \windowsize+i-1 is equal to length $l$ (max depth) of $\tChartFunction$, that is, the sliding window reaches the end of the template. +We design and implement a heuristic algorithm for computing the pipeline instance maximizing data quality. Our heuristic is built on a \emph{sliding window} and aims to maximize information \quality \emph{\q} according to quality metrics. At each step, a set of vertices in the pipeline template $\tChartFunction$ is selected according to a specific window size \windowsize, that select a subset of the pipeline template starting at depth $i$ and ending at depth \windowsize+i-1. Service filtering and selection in \cref{sec:instance} are then executed to maximiza \emph{\quality} in window $w$. The heuristic returns as output the list of services instantiating all vertices at depth $i$. The sliding window $w$ is then shifted by 1 (i.e., $i$=$i$+1) and the filtering and selection process executed until \windowsize+i-1 is equal to length $l$ (max depth) of $\tChartFunction$, that is, the sliding window reaches the end of the template. %For example, in our service selection problem where the quantity of information lost needs to be minimized, the sliding window algorithm can be used to select services composition that have the lowest information loss within a fixed-size window. -This strategy ensures that only services with low information loss are selected at each step, minimizing the information loss \emph{dloss}. The pseudocode of the heuristic algorithm is presented in \cref{lst:slidingwindowfirstservice}. +This strategy ensures that only services with low information loss are selected at each step, maximizing the information \quality \emph{\q}. The pseudocode of the heuristic algorithm is presented in \cref{lst:slidingwindowfirstservice}. \lstset{ % backgroundcolor=\color{white}, % choose the background color; you must add \usepackage{color} or \usepackage{xcolor} diff --git a/pipeline_instance.tex b/pipeline_instance.tex index c90042a..235ec6d 100644 --- a/pipeline_instance.tex +++ b/pipeline_instance.tex @@ -26,7 +26,7 @@ \section{Pipeline Instance}\label{sec:instance} \begin{enumerate} \item \textit{Filtering Algorithm} -- The filtering algorithm checks whether profile \profile$_j$ of each candidate service $\si{j}$$\in$$S^c_{i}$ satisfies at least one policy in \P{i}. If yes, service $\si{j}$ is compatible, otherwise it is discarded. The filtering algorithm finally returns a subset $S'_{i}$$\subseteq$$S^c_{i}$ of compatible services for each vertex \vi{i}$\in$$\V_S$. - \item \textit{Selection Algorithm} -- The selection algorithm selects one service $s'_i$ for each set $S'_{i}$ of compatible services and instantiates the corresponding vertex $\vii{i}$$\in$$\Vp$ with it. There are many ways of choosing $s'_i$, we present our approach based on the minimization of information loss \emph{dloss} in Section \ref{sec:heuristics}. + \item \textit{Selection Algorithm} -- The selection algorithm selects one service $s'_i$ for each set $S'_{i}$ of compatible services and instantiates the corresponding vertex $\vii{i}$$\in$$\Vp$ with it. There are many ways of choosing $s'_i$, we present our approach based on the maximization of information \quality \emph{\q} in Section \ref{sec:heuristics}. \end{enumerate} When all vertices $\vi{i}$$\in$$V$ in $G^{\myLambda,\myGamma}$ have been visited, the \pipelineInstance G' is finalized, with a service instance $s'_i$ for each \vii{i}$\in$\Vp. Vertex \vii{i} is still annotated with policies in \P{i} according to \myLambda, because policies in \P{i} are evaluated and enforced only when the pipeline instance is triggered, before any service is executed. In case policy evaluation returns \emph{true}, data transformation \TP$\in$\P{i} is applied, otherwise a default transformation that removes all data is applied.