diff --git a/metrics.tex b/metrics.tex index 34f2414..24c7a02 100644 --- a/metrics.tex +++ b/metrics.tex @@ -11,7 +11,7 @@ \subsection{Quality Metrics}\label{subsec:metrics} Quantitative metrics monitor the amount of data lost during data transformations to model the quality difference between datasets \origdataset\ and \transdataset. Qualitative metrics evaluate changes in the properties of datasets \origdataset\ and \transdataset. For instance, qualitative metrics can measure the changes in the statistical distribution of the two datasets. -In this paper, we use two metrics, one quantitative and one qualitative, to compare the input dataset \origdataset\ and dataset \transdataset\ generated by enforcing data protection requirements (i.e., our policy-driven transformation in Section~\ref{sec:instance}) on \origdataset\ at each step of the data pipeline. We note that a complete taxonomy of possible metrics is outside the scope of this paper and will be the target of our future work. +In this paper, we use two metrics, one quantitative and one qualitative, to compare the input dataset \origdataset\ and dataset \transdataset\ generated by enforcing data protection requirements (i.e., our policy-driven transformation in \cref{sec:instance}) on \origdataset\ at each step of the data pipeline. We note that a complete taxonomy of possible metrics is outside the scope of this paper and will be the target of our future work. \subsubsection{Quantitative metric} %We propose a metric that measures the similarity between two datasets, for this purpose, we use the Jaccard coefficient. @@ -27,48 +27,48 @@ \subsubsection{Quantitative metric} %The Weighted Jaccard coefficent can then account for element importance and provide a more accurate measure of similarity. \subsubsection{Qualitative Metric} -%We propose a metric that enables the measurement of the distance of two distributions. -We propose a qualitative metric $M_{JDS}$ based on the Jensen-Shannon Divergence (JSD) that measures the dissimilarity (distance) between the probability distributions of two datasets. +%We propose a metric that enables the measurement of the distance of two distributions. +We propose a qualitative metric $M_{JDS}$ based on the Jensen-Shannon Divergence (JSD) that measures the dissimilarity (distance) between the probability distributions of two datasets. JSD is a symmetrized version of the KL divergence~\cite{Fuglede} and is applicable to a pair of statistical distributions only. It is defined as follows: \[JSD(X, Y) = \frac{1}{2} \left( KL(X || M) + KL(Y || M) \right)\] % where X and Y are two distributions of the same size, and M$=$0.5*(X+Y) is the average distribution. -JSD incorporates both the KL divergence from X to M and from Y to M. +JSD incorporates both the KL divergence from X to M and from Y to M. -To make JSD applicable to datasets, where each feature in the dataset has its own statistical distribution, metric $M_{JDS}$ applies JSD to each column of the dataset. The obtained results are then aggregated using a weighted average, thus enabling the prioritization of important features that can be lost during the policy-driven transformation in Section~\ref{sec:heuristics}, as follows: \[M_{JDS} = \sum_{i=1}^n w_i \cdot \text{JSD}(x_i,y_i)\] +To make JSD applicable to datasets, where each feature in the dataset has its own statistical distribution, metric $M_{JDS}$ applies JSD to each column of the dataset. The obtained results are then aggregated using a weighted average, thus enabling the prioritization of important features that can be lost during the policy-driven transformation in \cref{sec:heuristics}, as follows: \[M_{JDS} = \sum_{i=1}^n w_i \cdot \text{JSD}(x_i,y_i)\] where \(w_i = \frac{n_i}{N}\) represents the weight for the \(i\)-th column, with \(n_i\) being the number of distinct elements in the $i$-th feature and \(N\) the total number of elements in the dataset. Each \(\text{JSD}(x_i,y_i)\) accounts for the Jensen-Shannon Divergence computed for the \(i\)-th feature in datasets X and Y. $M_{JDS}$ provides a weighted measure of dissimilarity, which is symmetric and accounts for the contribution from both datasets and specific features. It can compare the dissimilarity of the two datasets, providing a symmetric and normalized measure that considers the overall data distributions. \subsubsection{Information Loss} -%We note that our metrics can be applied either to the entire dataset or to specific features only. The features can be assigned with equal or varying importance, providing a weighted version of the metrics, thus enabling the prioritization of important features that might be possibly lost during the policy-driven transformation in Section~\cite{ADD}. +%We note that our metrics can be applied either to the entire dataset or to specific features only. The features can be assigned with equal or varying importance, providing a weighted version of the metrics, thus enabling the prioritization of important features that might be possibly lost during the policy-driven transformation in Section~\cite{ADD}. -Metrics $M_J$ and $M_{JDS}$ contribute to the calculation of the information loss \textit{dloss} throughout the pipeline execution. It is calculated as the average \emph{AVG} of the information loss at each vertex \vi{i}$\in$$\V_S$ of the service pipeline $G(V,E)$ as follows. +Metrics $M_J$ and $M_{JDS}$ contribute to the calculation of the information loss \textit{dloss} throughout the pipeline execution. It is calculated as the average \emph{AVG} of the information loss at each vertex \vi{i}$\in$$\V_S$ of the service pipeline $G(V,E)$ as follows. \begin{definition}[\emph{dloss}] - Given a metrics M$\in$$\{M_J,M_{JDS}$\}, information loss \textit{dloss} is calculated as 1$-$\emph{AVG}($M_ij$), with $M_{ij}$ the value of the quality metric retrieved at each vertex \vi{i}$\in$$\V_S$ of the service pipeline $G(V,E)$ according to service \si{j}. + Given a metrics M$\in$$\{M_J,M_{JDS}$\}, information loss \textit{dloss} is calculated as 1$-$\emph{AVG}($M_ij$), with $M_{ij}$ the value of the quality metric retrieved at each vertex \vi{i}$\in$$\V_S$ of the service pipeline $G(V,E)$ according to service \si{j}. \end{definition} -We note that \textit{dloss}$_{ij}$$=$1$-$$M_i$ models the quality loss at vertex \vi{i}$\in$$\V_S$ of the service pipeline $G(V,E)$ for service \si{j}. +We note that \textit{dloss}$_{ij}$$=$1$-$$M_i$ models the quality loss at vertex \vi{i}$\in$$\V_S$ of the service pipeline $G(V,E)$ for service \si{j}. %We also note that information loss \textit{dloss} is used to generate the Max-Quality pipeline instance in the remaining of this section. \subsection{NP-Hardness of the Max-Quality Pipeline Instantiation Problem}\label{sec:nphard} %\hl{se lo definiamo in maniera formale come il problema di trovare un'istanza valida in accordo alla definizione di istanza tale che non ne esiste una con un loss piu' piccolo?} -The problem of computing a pipeline instance (Definition~\ref{def:instance}) with maximum quality (minimum information loss) can be formally defined as follows. +The problem of computing a pipeline instance (\cref{def:instance}) with maximum quality (minimum information loss) can be formally defined as follows. \begin{definition}[Max-Quality Problem]\label{def:MaXQualityInstance} - Given a pipeline template $G^{\myLambda,\myGamma}$ and a set $S^c$ of candidate services, find a max-quality pipeline instance $G'$ such that: + Given a pipeline template $G^{\myLambda,\myGamma}$ and a set $S^c$ of candidate services, find a max-quality pipeline instance $G'$ such that: \begin{itemize} - \item $G'$ satisfies conditions in Definition~\ref{def:instance}, - \item $\nexists$ a pipeline instance $G''$ that satisfies conditions in Definition~\ref{def:instance} and such that information loss \textit{dtloss}($G''$)$<$\textit{dtloss}($G'$), where \textit{dtloss}($\cdot$) is the information loss throughout the pipeline execution. + \item $G'$ satisfies conditions in \cref{def:instance}, + \item $\nexists$ a pipeline instance $G''$ that satisfies conditions in \cref{def:instance} and such that information loss \textit{dtloss}($G''$)$<$\textit{dtloss}($G'$), where \textit{dtloss}($\cdot$) is the information loss throughout the pipeline execution. %computed after applying the transformation of the policy matching the service selected to instantiate vertex \vi{i}$\in$$\V_S$, . \end{itemize} \end{definition} -The Max Quality \problem is a combinatorial selection problem and is NP-hard, as stated by Theorem \ref{theorem:NP}. However, while the overall problem is NP-hard, there is a component of the problem that is solvable in polynomial time: matching the profile of each service with the corresponding vertex policy. This can be done by iterating over each vertex and each service, checking if the service matches the vertex policy. This process would take $O(|N|*|S|)$ time. This is polynomial time complexity. +The Max Quality \problem is a combinatorial selection problem and is NP-hard, as stated by Theorem \cref{theorem:NP}. However, while the overall problem is NP-hard, there is a component of the problem that is solvable in polynomial time: matching the profile of each service with the corresponding vertex policy. This can be done by iterating over each vertex and each service, checking if the service matches the vertex policy. This process would take $O(|N|*|S|)$ time. This is polynomial time complexity. \begin{theorem}\label{theorem:NP} The Max-Quality \problem is NP-Hard. @@ -82,7 +82,7 @@ \subsection{NP-Hardness of the Max-Quality Pipeline Instantiation Problem}\label \begin{example}[Max-Quality Pipeline Instance] - Let us start from Example~\ref{ex:instance} and extend it with the comparison algorithm in Section~\ref{sec:instance} built on \emph{dloss}. The comparison algorithm is applied to the set of services $S'_*$ and returns three service rankings one for each vertex \vi{4}, \vi{5}, \vi{6} according to the amount of data anonymized. + Let us start from \cref{ex:instance} and extend it with the comparison algorithm in \cref{sec:instance} built on \emph{dloss}. The comparison algorithm is applied to the set of services $S'_*$ and returns three service rankings one for each vertex \vi{4}, \vi{5}, \vi{6} according to the amount of data anonymized. The ranking is listed in \cref{tab:instance_example_maxquality}(b) and based on the transformation function in the policies. We assume that the more restrictive the transformation function (i.e., it anonymizes more data), the lower is the service position in the ranking. For example, \s{11} is ranked first because it anonymizes less data than \s{12} and \s{13}. The ranking of \s{22} and \s{23} is based on the same logic. @@ -101,10 +101,10 @@ \subsection{Heuristic}\label{subsec:heuristics} %The computational challenge posed by the enumeration of all possible combinations within a given set is a well-established NP-hard problem.} %The exhaustive exploration of such combinations swiftly becomes impractical in terms of computational time and resources, particularly when dealing with the analysis of complex pipelines. %In response to this computational complexity, the incorporation of heuristic emerges as a strategy to try to efficiently address the problem. -%\hl{HO RIVISTO IL PARAGRAFO VELOCEMENTE GIUSTO PER DARE UN'INDICAZIONE. DOBBIAMO USARE LA FORMALIZZAZIONE E MAGARI FORMALIZZARE ANCHE LO PSEUDOCODICE.} -We design and implement a heuristic algorithm for computing the pipeline instance maximizing data quality. Our heuristic is built on a \emph{sliding window} and aims to minimize information loss according to quality metrics. At each step, a set of vertices in the pipeline template $\tChartFunction$ is selected according to a specific window size w=[i,j], where $i$ and $j$ are the starting and ending depth of window w. Service filtering and selection in Section~\ref{sec:instance} are then executed to minimize information loss in window w. The heuristic returns as output the list of services instantiating vertexes at depth $i$. A new window w=[i+1,j+1] is considered until $j$+1 is equal to the max depth of $\tChartFunction$, that is, the window reaches the end of the template. +%\hl{HO RIVISTO IL PARAGRAFO VELOCEMENTE GIUSTO PER DARE UN'INDICAZIONE. DOBBIAMO USARE LA FORMALIZZAZIONE E MAGARI FORMALIZZARE ANCHE LO PSEUDOCODICE.} +We design and implement a heuristic algorithm for computing the pipeline instance maximizing data quality. Our heuristic is built on a \emph{sliding window} and aims to minimize information loss according to quality metrics. At each step, a set of vertices in the pipeline template $\tChartFunction$ is selected according to a specific window size w=[i,j], where $i$ and $j$ are the starting and ending depth of window w. Service filtering and selection in \cref{sec:instance} are then executed to minimize information loss in window w. The heuristic returns as output the list of services instantiating vertexes at depth $i$. A new window w=[i+1,j+1] is considered until $j$+1 is equal to the max depth of $\tChartFunction$, that is, the window reaches the end of the template. %For example, in our service selection problem where the quantity of information lost needs to be minimized, the sliding window algorithm can be used to select services composition that have the lowest information loss within a fixed-size window. -This strategy ensures that only services with low information loss are selected at each step, minimizing the \hl{overall o average?} information loss. Pseudo-code for the sliding window algorithm is presented in Algorithm 1. +This strategy ensures that only services with low information loss are selected at each step, minimizing the average information loss. Pseudo-code for the sliding window algorithm is presented in \cref{lst:slidingwindowfirstservice}. \lstset{ % backgroundcolor=\color{white}, % choose the background color; you must add \usepackage{color} or \usepackage{xcolor} @@ -136,30 +136,40 @@ \subsection{Heuristic}\label{subsec:heuristics} } \begin{lstlisting}[frame=single,mathescape, caption={Sliding Window Heuristic with Selection of First Service from Optimal Combination},label={lst:slidingwindowfirstservice}] - selectedServices = empty - for i from 0 to length(serviceCombinations): - minMetricCombination = None - minMetric = $+\infty$ - M = JSD or J //JSD or Jaccard coefficient - - for j from i to i + windowSize: - totalMetric = 0 - for service in serviceCombinations[j]: - totalMetric += M(service) - currentMetric = totalMetric / length(serviceCombinations[j]) - if currentMetric < minMetric: + function SlidingWindowHeuristic(verticesList, w){ + $\text{G'}$ = [] + for i from 0 to length(verticesList) - w + 1 + { + minMetric = $\infty$ + minMetricCombination = [] + for windowIndex from i to i + w - 1{ + currentCombination = verticesList[windowIndex].services + totalMetric = 0 + for service in currentCombination{ + totalMetric += M(service) + } + currentMetric = totalMetric / length(currentCombination) + if currentMetric < minMetric{ minMetric = currentMetric - minMetricCombination = serviceCombinations[j] - firstService = serviceCombinations[j][0] - add firstService to instance - return instance + minMetricCombination = currentCombination + } + } + if isLastWindowFrame(){ + $\text{G'}$.append(minMetricCombination) + }else{ + if length(minMetricCombination) > 0 + $\text{G'}$.append(minMetricCombination[0]) + } + } + return $\text{G'}$ + } \end{lstlisting} -\hl{NON CHIARA, COSA SONO NODE?} -The pseudocode implements function {\em SlidingWindowHeuristic}, which takes a sequence of vertices and a window size as input and returns a set of selected vertices as output. The function starts by initializing an empty set of selected vertices (line 3). Then, for each node in the sequence (lines 4--12), the algorithm iterates over the vertices in the window (lines 7--11) and selects the node with the lowest metric value (lines 9-11). The selected node is then added to the set of selected vertices (line 12). Finally, the set of selected vertices is returned as output (line 13). + The function SlidingWindowHeuristic processes a list of vertices, each associated with a list of services, to identify optimal service combinations using a sliding window approach, given the constraints set by parameters verticesList and w (window size). -We note that a window of size 1 corresponds to the \emph{greedy} approach, while a window of size N, where N represents the total number of vertices, corresponds to the \emph{exhaustive} method. + Initially, the function establishes $\text{G'}$ to store the optimal services or combinations identified during the process (line 2). It iterates from the start to the feasible end of the vertex list to ensure each possible window of services is evaluated (line 3). For each window, the function initializes minMetric to infinity and an empty list minMetricCombination to store the best service combination found within that specific window (line 5-6). -The heuristic for service selection can be enhanced through the integration of other optimization algorithms, such as Ant Colony Optimization or Tabu Search. By integrating these approaches, it becomes feasible to achieve a more effective and efficient selection of services, with a specific focus on deleting paths previously deemed unfavorable. + Within each window, the function iterates through the vertices (line 7), calculating the total metric for services in the current vertex (lines 10-12). It then determines the average metric for these services and checks if this average is the lowest encountered so far within the current window (lines 13-15). If so, it updates minMetric and records the current combination as the best for this window. + The function then checks if it is processing the last window frame using isLastWindowFrame() (line 19). If true, all services in the best combination for this window are added to the result list; otherwise, only the first service of the best combination is selected (lines 20-23). The function concludes by returning the $\text{G'}$ (line 26), which contains the selected services or combinations based on the heuristic evaluation across all windows. %\AG{It is imperative to bear in mind that the merging operations subsequent to the selection process and the joining operations subsequent to the branching process are executed with distinct objectives. In the former case, the primary aim is to optimize quality, whereas in the latter, the foremost objective is to minimize it.}