diff --git a/experiment.tex b/experiment.tex index ee04f65..28729fe 100644 --- a/experiment.tex +++ b/experiment.tex @@ -1,7 +1,6 @@ \section{Experiments}\label{sec:experiment} We experimentally evaluated the performance and quality of our methodology (heuristic algorithm in \cref{subsec:heuristics}), and compared it against the exhaustive approach in~\cref{sec:nphard}. In the following, \cref{subsec:experiments_infrastructure} presents the simulator and experimental settings used in our experiments; -%, as well as the complete experimental settings; \cref{subsec:experiments_performance} analyses the performance of our solution in terms of execution time; \cref{subsec:experiments_quality} discusses the quality of the best pipeline instance generated by our solution according to the metrics $M_J$ and $M_{JSD}$ in \cref{subsec:metrics}. \subsection{Testing Infrastructure and Experimental Settings}\label{subsec:experiments_infrastructure} @@ -30,7 +29,6 @@ \subsection{Testing Infrastructure and Experimental Settings}\label{subsec:exper \resizebox{0.7\columnwidth}{!}{% \begin{tikzpicture}[framed] - \node[draw, circle, fill=gray!40,minimum width=0.7cm] (v1) at (1,5.2) {$\vi{1}$}; \node[draw, circle, fill=gray!40,minimum width=0.7cm] (v2) at (3,5.2) {$\vi{2}$}; \node[draw, circle, fill=gray!40,minimum width=0.7cm] (v3) at (5,5.2) {$\vi{3}$}; @@ -58,24 +56,11 @@ \subsection{Testing Infrastructure and Experimental Settings}\label{subsec:exper \node[draw, rectangle] (s42) at (9,1.7) {$\sii{42}$}; \node[draw, rectangle] (s43) at (9,0) {$\sii{43}$}; - - - % \draw[->] (node2) -- (node3); - % \draw[->] (s1) -- (s11); - %\draw[->] (s2) -- (s12); - % \draw[->] (s3) -- (s13); - - % \draw[->] (s1) -- (s11); - % \draw[->] (s1) -- (s12); - % \draw[->] (s1) -- (s13); - \draw[->,line width= 1.2pt] (s2) -- (s11); \draw[->,dashdotted] (s2) -- (s12); \draw[->,dashdotted] (s2) -- (s13); \draw[->,line width= 1pt] (s11) -- (s22); - % \draw[->,dashdotted] (s2) -- (s12); - % \draw[->,dashdotted] (s2) -- (s13); \draw[->,dashdotted] (s11) -- (s21); \draw[->,dashdotted] (s11) -- (s23); @@ -88,7 +73,6 @@ \subsection{Testing Infrastructure and Experimental Settings}\label{subsec:exper \draw[->,dashdotted] (s13) -- (s22); \draw[->,dashdotted] (s13) -- (s23); - \draw[->,dashdotted] (s21) -- (s31); \draw[->,dashdotted] (s21) -- (s32); \draw[->,dashdotted] (s21) -- (s33); @@ -97,7 +81,6 @@ \subsection{Testing Infrastructure and Experimental Settings}\label{subsec:exper \draw[->,dashdotted] (s22) -- (s32); \draw[->,dashdotted] (s22) -- (s33); - \draw[->,dashdotted] (s23) -- (s31); \draw[->,dashdotted] (s23) -- (s32); \draw[->,dashdotted] (s23) -- (s33); @@ -107,14 +90,10 @@ \subsection{Testing Infrastructure and Experimental Settings}\label{subsec:exper \draw[->] (v3) -- (v4); \draw[->] (v4) -- (v5); - - \begin{scope}[on background layer] \draw[thick, dashed, fill=red!10, opacity=0.5] ([shift={(-0.5,0.5)}]s11.north west) rectangle ([shift={(0.5,-0.5)}]s33.south east); - - \end{scope} \begin{scope}[on background layer] \draw[thick, dashed, fill=red!10, opacity=0.5] @@ -122,8 +101,6 @@ \subsection{Testing Infrastructure and Experimental Settings}\label{subsec:exper \end{scope} - % \node[align = center, below,yshift=-20pt ] at (s23.south) {\ttfamily \scriptsize vertices=5 services=3 \windowsize=3 i=1}; - \end{tikzpicture} } \caption{Execution example of the sliding window heuristic using v=5, s=3, \windowsize=3 at i=1 step.} @@ -133,7 +110,7 @@ \subsection{Testing Infrastructure and Experimental Settings}\label{subsec:exper \subsection{Perfomance}\label{subsec:experiments_performance} We first measured the performance (execution time) of our exhaustive and heuristic solutions by varying the number of vertices in the pipeline template from 2 to 7 and the number of services per vertex from 2 to 7. \cref{fig:time_window_perce_average} presents our results for both the exhaustive and heuristic solutions. The exhaustive approach is able to provide the optimal solution for all configurations, but its execution time grows exponentially with the number of vertices and services, making it impractical for large instances. For \windowsize from 1 to 3 (step 1), we observed a substantial reduction in execution time, with the heuristic always able to produce an instance in less than $\approx2.7\times10^5ms$ . The worst heuristic performance (7 vertices, 7 services, \windowsize=6) is $\approx3.8\times10^7ms$ is still one order of magnitude lower than the best exhaustive performance (7 vertices, 7 services, \windowsize=7) $\approx1.35\times10^8ms$. - \begin{figure}[!htb] + \begin{figure}[!t] \centering \begin{subfigure}{0.45\textwidth} \includegraphics[width=\textwidth]{Images/graphs/window_time_performance_qualitative_n7_s7_50_80_n3} @@ -165,57 +142,11 @@ \subsection{Testing Infrastructure and Experimental Settings}\label{subsec:exper \end{subfigure} \label{fig:time_window_perce_average} \end{figure} - % % \hfill - % % \begin{subfigure}{0.33\textwidth} - % % \includegraphics[width=\textwidth]{Images/graphs/quality_plot_average_n7.eps} - % % \caption{7 vertices} - % % \label{fig:third} - % % \end{subfigure} - % \caption{Evaluation of Execution Time Using the \emph{Qaulitative} Metric in a \average Profile Configuration.} \label{fig:time_window_perce_wide} - % \end{figure} - % \begin{figure}[!htb] - % \centering - % \begin{subfigure}{0.45\textwidth} - % \includegraphics[width=\textwidth]{Images/graphs/window_time_performance_n7_s7_20_100_n3} - % \caption{3 vertices} - % \label{fig:time_window_perce_average_3n} - % \end{subfigure} - % \hfill - % \begin{subfigure}{0.45\textwidth} - % \includegraphics[width=\textwidth]{Images/graphs/window_time_performance_n7_s7_20_100_n4} - % \caption{4 vertices} - % \label{fig:time_window_perce_average_4n} - % \end{subfigure} - % \hfill - % \begin{subfigure}{0.45\textwidth} - % \includegraphics[width=\textwidth]{Images/graphs/window_time_performance_n7_s7_20_100_n5} - % \caption{5 vertices} - % \label{fig:time_window_perce_average_5n} - % \end{subfigure} - % \hfill - % \begin{subfigure}{0.45\textwidth} - % \includegraphics[width=\textwidth]{Images/graphs/window_time_performance_n7_s7_20_100_n6} - % \caption{6 vertices} - % \label{fig:time_window_perce_average_6n} - % \end{subfigure} - % \begin{subfigure}{0.45\textwidth} - % \includegraphics[width=\textwidth]{Images/graphs/window_time_performance_n7_s7_20_100_n7} - % \caption{7 vertices} - % \label{fig:time_window_perce_average_7n}polo - % \end{subfigure} - % % \hfill - % % \begin{subfigure}{0.33\textwidth} - % % \includegraphics[width=\textwidth]{Images/graphs/quality_plot_average_n7.eps} - % % \caption{7 vertices} - % % \label{fig:third} - % % \end{subfigure} - % \caption{Evaluation of Execution Time Using the \emph{Quantitative} Metric in a \average Profile Configuration.} - + \subsection{Quality}\label{subsec:experiments_quality} - We finally evaluated the quality of our heuristic algorithm with different \windowsize\ comparing, where possible, its results with the optimal solution retrieved by executing the exhaustive approach. %The latter is executed with window size equals to the number of vertices in the pipeline template, and provides the best, among all possible, solutions. + We finally evaluated the quality of our heuristic algorithm with different \windowsize\ comparing, where possible, its results with the optimal solution retrieved by executing the exhaustive approach. The quality $Q$ of the heuristic has been normalized between 0 and 1 by dividing it by the quality $Q^*$ retrieved by the exhaustive approach. - We run our experiments varying: \emph{i)} the length $l$ of the pipeline template in [3,7], that is, the depth of the pipeline template as the number of vertices composed in a sequence, \emph{ii)} the window size \windowsize\ in [1,$l$], and \emph{iii)} the number of candidate services for each vertex in the pipeline template in [2, 7]. Each vertex is associated with a (set of) policy that applies a filtering transformation that either remove a percentage of data in $[0.5,0.8]$ (\average) or in $[0.2,1]$ (\wide). \cref{fig:quality_window_perce} present our quality results using metric $M_J$ in \cref{subsec:metrics} for settings \wide and \average. @@ -224,21 +155,15 @@ \subsection{Testing Infrastructure and Experimental Settings}\label{subsec:exper When considering setting \wide, the greedy approach (\windowsize=1) provides good results on average (0.71, 0.90), while showing substantial quality oscillations in specific runs: between 0.882 and 0.970 for 3 vertices, 0.810 and 0.942 for 4 vertices, 0.580 and 0.853 for 5 vertices, 0.682 and 0.943 for 6 vertices, 0.596 and 0.821 for 7 vertices. This same trend emerges when the window size is $<$$l$/2, while it starts approaching the optimum when the window size is $\geq$$l$/2. For instance, when \windowsize=$l$-1, the quality varies between 0.957 and 1.0 for 3 vertices, 0.982 and 1.0 for 4 vertices, 0.986 and 0.998 for 5 vertices, 0.977 and 1.0 for 6 vertices, 0.996 and 1.0 for 7 vertices. When considering setting \average, the heuristic algorithm still provides good results, limiting the quality oscillations observed for setting \wide\ and approaching the quality of the exhaustive also for lower window sizes. The greedy approach (\windowsize=1) provides good results on average (from 0.842 to 0.944), as well as in specific runs: between 0.927 and 0.978 for 3 vertices, 0.903 and 0.962 for 4 vertices, 0.840 and 0.915 for 5 vertices, 0.815 and 0.934 for 6 vertices, 0.721 and 0.935 for 7 vertices. - %This same trend emerges when the window size is less than $l$/2, while it starts approaching the optimum when the window size is higher than $l$/2. For instance, When \windowsize=$l$-1, the quality varies between 0.980 and 1.0 for 3 vertices, 0.978 and 1.0 for 4 vertices, 0.954 and 1 for 5 vertices, 0.987 and 1.0 for 6 vertices, 0.990 and 1.0 for 7 vertices. - \cref{fig:quality_window_qualitative} present our quality results using metric $M_{JSD}$ in \cref{subsec:metrics} for settings \wide and \average, respectively. - % In general, \hl{ANTONGIACOMO} - When considering setting \wide, the greedy approach (\windowsize=1) provides good results on average (0.92, 0.97), limiting oscillations observed with metric $M_J$; for instance, the quality varies between 0.951 and 0.989 for 3 vertices, 0.941 and 0.988 for 4 vertices, 0.919 and 0.974 for 5 vertices, 0.911 and 0.971 for 6 vertices, 0.877 and 0.924 for 7 vertices. %In this case the quality oscillations are more stable than the ones observed for the metric $M_J$. + When considering setting \wide, the greedy approach (\windowsize=1) provides good results on average (0.92, 0.97), limiting oscillations observed with metric $M_J$; for instance, the quality varies between 0.951 and 0.989 for 3 vertices, 0.941 and 0.988 for 4 vertices, 0.919 and 0.974 for 5 vertices, 0.911 and 0.971 for 6 vertices, 0.877 and 0.924 for 7 vertices. The worst quality results are obtained with window size equal to 1, while the oscillations are negligible when the window size is >2. For instance, when \windowsize=$l$-2, the quality varies between, 0.982 and 0.996 for 4 vertices, 0.981 and 0.998 for 5 vertices, 0.988 and 1.0 for 6 vertices, 0.976 and 0.999 for 7 vertices. When \windowsize=$l$-1, the quality varies between 0.987 and 0.998 for 3 vertices, 0.993 and 1.0 for 4 vertices, 0.985 and 0.999 for 5 vertices, 0.997 and 1.0 for 6 vertices, 0.995 and 1.0 for 7 vertices. When considering setting \average, the greedy approach (\windowsize=1) provides results similar to setting \wide. On average, quality varies from 0.920 to 0.969, limiting oscillations; for instance, the quality varies between 0.951 and 0.989 for 3 vertices, 0.942 and 0.988 for 4 vertices, 0.919 and 0.975 for 5 vertices, 0.912 and 0.972 for 6 vertices, 0.878 and 0.925 for 7 vertices. The \average configuration provides even tighter quality oscillations than the \wide configuration. Notably, the poorest quality outcomes are observed when the window size is set to 1. Conversely, these oscillations become negligible when the window size exceeds 1 in configurations with three and four vertices, and when it exceeds 2 in configurations involving five, six, and seven vertices. For instance, when \windowsize=3, the quality varies between 0.993 and 1 for 4 vertices, 0.981 and 0.998 for 5 vertices, 0.982 and 997 for 6 vertices, 0.960 and 0.991 for 7 vertices. - - - Our results suggest that the proposed heuristic well approximates the results obtained by the exhaustive approach. While larger window sizes generally lead to better performance, there exists a breakpoint where the balance between window size and performance is optimized. Beyond this point, the incremental gains in metric values may not justify the additional computational burden or complexity introduced by larger windows. It is worth noting that lower window sizes are more unstable, especially with setting \wide, meaning that the quality varies significantly among different configurations. This effect stabilizes with higher window sizes (e.g., \windowsize$\geq$$l$/2). \begin{figure}[H] \centering diff --git a/metrics.tex b/metrics.tex index c775d81..ab0d28d 100644 --- a/metrics.tex +++ b/metrics.tex @@ -1,20 +1,14 @@ \section{Maximizing the Pipeline Instance Quality}\label{sec:heuristics} -% -% %Ovviamente non รจ sufficiente scegliere il best service per ogni vertice, ma diventa un problema complesso dove si devono calcolare/valutare tutte le possibili combinazioni dei servizi disponibili, tra le quali scegliere la migliore. -Our goal is to generate a pipeline instance with maximum quality \q, addressing data protection requirements throughout the pipeline execution. To this aim, we first discuss the quality metrics used to measure and monitor data quality \q, which guide the generation of the pipeline instance with maximum \q. Then, we prove that the problem of generating a pipeline instance with maximum \q\ is NP-hard (\cref{sec:nphard}). Finally, we introduce a parametric heuristic (\cref{subsec:heuristics}) designed to tackle the computational complexity associated with enumerating all possible combinations within a given set. The main objective of the heuristic is to approximate the optimal path for service interactions and transformations, especially within the realm of complex pipelines consisting of numerous vertices and candidate services. Our focus extends beyond identifying optimal combinations to include an understanding of the quality changes introduced during the transformation processes. - -%Inspired by existing literature, these metrics, categorized as quantitative and statistical, play a pivotal role in quantifying the impact of policy-driven transformations on the original dataset. +Our goal is to generate a pipeline instance with maximum quality, addressing data protection requirements throughout the pipeline execution. To this aim, we first discuss the quality metrics used to measure and monitor data quality, which guide the generation of the pipeline instance with maximum quality. Then, we prove that the problem of generating a pipeline instance with maximum quality is NP-hard (\cref{sec:nphard}). Finally, we introduce a parametric heuristic (\cref{subsec:heuristics}) designed to tackle the computational complexity associated with enumerating all possible combinations within a given set. The main objective of the heuristic is to approximate the optimal path for service interactions and transformations, especially within the realm of complex pipelines consisting of numerous vertices and candidate services. Our focus extends beyond identifying optimal combinations to include an understanding of the quality changes introduced during the transformation processes. \subsection{Quality Metrics}\label{subsec:metrics} -%Ensuring data quality is mandatory to implement data pipelines that provide accurate results and decision-making along the whole pipeline execution. To this aim, we define two metrics evaluating the quality loss introduced by our policy-driven transformation in Section~\cite{ADD} on the input dataset \origdataset at each step of the data pipeline. Our metrics can be classified as \emph{quantitative} and \emph{qualitative}~\cite{ADD}, and compare the input dataset \origdataset\ and dataset \transdataset\ generated by enforcing data protection requirements on \origdataset. -Ensuring data quality is mandatory to implement data pipelines that provide accurate results and decision-making along the whole pipeline execution. Quality metrics measure the data quality preserved at each step of the data pipeline, and can be classified as \emph{quantitative} or \emph{qualitative}. %~\cite{ADD}\hl{CITE}. +Ensuring data quality is mandatory to implement data pipelines that provide accurate results and decision-making along the whole pipeline execution. Quality metrics measure the data quality preserved at each step of the data pipeline, and can be classified as \emph{quantitative} or \emph{qualitative}. Quantitative metrics monitor the amount of data lost during data transformations to model the quality difference between datasets \origdataset\ and \transdataset. Qualitative metrics evaluate changes in the properties of datasets \origdataset\ and \transdataset. For instance, qualitative metrics can measure the changes in the statistical distribution of the two datasets. -In this paper, we use two metrics, one quantitative and one qualitative, to compare the input dataset \origdataset\ and dataset \transdataset\ generated by enforcing data protection requirements (i.e., our policy-driven transformation in \cref{sec:instance}) on \origdataset\ at each step of the pipeline. We note that a complete taxonomy of possible metrics is outside the scope of this paper and will be the target of our future work. +In this paper, we use two metrics, one quantitative and one qualitative, to compare the input dataset \origdataset\ and the dataset \transdataset\ obtained by enforcing data protection requirements (i.e., our policy-driven transformation described in \cref{sec:instance}) on \origdataset\ at each step of the pipeline. We note that a complete taxonomy of possible metrics is outside the scope of this paper and will be the target of our future work. \subsubsection{Quantitative metric} -%We propose a metric that measures the similarity between two datasets, for this purpose, we use the Jaccard coefficient. We propose a quantitative metric $M_J$ based on the Jaccard coefficient that assesses the similarity between two datasets. The Jaccard coefficient is defined as follows \cite{RAHMAN20102707}: \[J(X,Y) = \frac{|X \cap Y|}{|X \cup Y|}\] where X and Y are two datasets of the same size. @@ -23,11 +17,9 @@ \subsubsection{Quantitative metric} Metric $M_J$ extends the Jaccard coefficient with weights that model the importance of each element in the dataset as follows:\[M_J(X,Y) = \frac{\sum_{i=1}^{n}w_i(x_i \cap y_i)}{\sum_{i=1}^{n}w_i(x_i \cup y_i)}\] where $x_i$$\in$X ($y_i$$\in$Y, resp.) is the $i$-th feature of dataset X (Y, resp.), and $w_i$ the weight modeling the importance of the $i$-th feature. -It is computed by dividing the cardinality of the intersection of two datasets by the cardinality of their union, weighted by the importance of each feature in the datasets. It provides a more accurate measure of similarity. %Weights prioritize certain elements (e.g., a specific feature) in the datasets. -%The Weighted Jaccard coefficent can then account for element importance and provide a more accurate measure of similarity. +It is computed by dividing the cardinality of the intersection of two datasets by the cardinality of their union, weighted by the importance of each feature in the datasets. It provides a more accurate measure of similarity. \subsubsection{Qualitative Metric} -%We propose a metric that enables the measurement of the distance of two distributions. We propose a qualitative metric $M_{JSD}$ based on the Jensen-Shannon Divergence (JSD) that assesses the similarity (distance) between the probability distributions of two datasets. JSD is a symmetrized version of the KL divergence~\cite{Fuglede} and is applicable to a pair of statistical distributions only. It is defined as follows: @@ -38,18 +30,13 @@ \subsubsection{Qualitative Metric} JSD incorporates both the KL divergence from X to M and from Y to M. To make JSD applicable to datasets, where each feature in the dataset has its own statistical distribution, metric $M_{JSD}$ applies JSD to each column of the dataset. The obtained results are then aggregated using a weighted average, thus enabling the prioritization of important features that can be lost during the policy-driven transformation in \cref{sec:heuristics}, as follows: \[M_{JSD} = 1 - \sum_{i=1}^n w_i \cdot \text{JSD}(x_i,y_i)\] -%where \(w_i = \frac{n_i}{N}\) represents the weight for the \(i\)-th column, with \(n_i\) being the number of distinct elements in the $i$-th feature and \(N\) the total number of elements in the dataset. where $\sum_{i=1}^n w_i$$=$1 and each \(\text{JSD}(x_i,y_i)\) accounts for the Jensen-Shannon Divergence computed for the \(i\)-th feature in datasets X and Y. It ranges from 0 to 1, with 0 indicating no similarity (minimum quality) and 1 indicating complete similarity (maximum quality) between the datasets. -%Must be noted that the one minus has been added to the formula to transfrom the metric into a similarity metric, where 1 indicates complete similarity and 0 indicates no similarity. $M_{JSD}$ provides a weighted measure of similarity, which is symmetric and accounts for the contribution from both datasets and specific features. It can compare the similarity of the two datasets, providing a symmetric and normalized measure that considers the overall data distributions. \subsubsection{Pipeline Quality} -%We note that our metrics can be applied either to the entire dataset or to specific features only. The features can be assigned with equal or varying importance, providing a weighted version of the metrics, thus enabling the prioritization of important features that might be possibly lost during the policy-driven transformation in Section~\cite{ADD}. - -Metrics $M_J$ and $M_{JSD}$ contribute to the calculation of the pipeline quality \q\ as follows. %Information loss is calculated as the average \emph{AVG} of data at each vertex \vi{i}$\in$$\V_S$ of the service pipeline $G(V,E)$ as follows. - +Metrics $M_J$ and $M_{JSD}$ contribute to the calculation of the pipeline quality \q\ as follows. \vspace{0.5em} \begin{definition}[\emph{\PipQuality}]\label{def:quality} @@ -57,12 +44,9 @@ \subsubsection{Pipeline Quality} \end{definition} \vspace{0.5em} -%We note that $M_{ij}$ models the average data quality preserved within the pipeline instance $G'$. -We also use the notation $\q_{ij}$, with $\q_{ij} = M_{ij}$, to specify the \quality at vertex \vii{i}$\in$$\V'_S$ of $G'$ for service \sii{j}. -%We also note that information loss \textit{dloss} is used to generate the Max-Quality pipeline instance in the remaining of this section. +We also use the notation $\q_{ij}$, with $\q_{ij}$$=$$M_{ij}$, to specify the \quality at vertex \vii{i}$\in$$\V'_S$ of $G'$ for service \sii{j}. \subsection{NP-Hardness of the Max-Quality Pipeline Instantiation Problem}\label{sec:nphard} -%The problem of computing a pipeline instance (\cref{def:instance}) with maximum quality \q\ can be formally defined as follows. The process of computing a pipeline instance (\cref{def:instance}) with maximum quality \q\ can be formally defined as follows. \vspace{0.5em} @@ -71,15 +55,13 @@ \subsection{NP-Hardness of the Max-Quality Pipeline Instantiation Problem}\label \begin{itemize} \item $G'$ satisfies conditions in \cref{def:instance}, \item $\nexists$ a pipeline instance $G''$ that satisfies conditions in \cref{def:instance} and such that quality \q($G''$)$>$\q($G'$), where \q($\cdot$) is the pipeline quality in Definition~\ref{def:quality}. - %computed after applying the transformation of the policy matching the service selected to instantiate vertex \vi{i}$\in$$\V_S$, . \end{itemize} \end{definition} \vspace{0.5em} The Max-Quality problem is a combinatorial selection problem and is NP-hard, as stated by \cref{theorem:NP}. However, while the overall problem is NP-hard, the filtering step of the process, is solvable in polynomial time. -%However, while the overall problem is NP-hard, there is a component of the problem, i.e., matching the profile of each service with the corresponding vertex policy, that is solvable in polynomial time. -It can be done by iterating over each vertex and each service, checking if the service matches the vertex policy. This process takes polynomial time complexity $O(|N|*|S|)$. +It can be done by iterating over each vertex and each service, checking if the service matches the vertex policy. This process takes polynomial time complexity $O(|V_S|*|S|)$. \vspace{0.5em} @@ -89,7 +71,7 @@ \subsection{NP-Hardness of the Max-Quality Pipeline Instantiation Problem}\label \emph{Proof: } The proof is a reduction from the multiple-choice knapsack problem (MCKP), a classified NP-hard combinatorial optimization problem, which is a generalization of the simple knapsack problem (KP) \cite{Kellerer2004}. In the MCKP problem, there are $t$ mutually disjoint classes $N_1,N_2,\ldots,N_t$ of items to pack in some knapsack of capacity $C$, class $N_i$ having size $n_i$. Each item $j$$\in$$N_i$ has a profit $p_{ij}$ and a weight $w_{ij}$; the problem is to choose one item from each class such that the profit sum is maximized without having the weight sum to exceed C. -The MCKP can be reduced to the Max quality \problem in polynomial time, with $N_1,N_2,\ldots,N_t$ corresponding to the sets of compatible services $S^c_{1}, S^c_{2}, \ldots, S^c_{u}$, with $t$$=$$u$ and $n_i$ also the size of each set $S^c_{i}$. The profit $p_{ij}$ of item $j$$\in$$N_i$ corresponds to quality \textit{\q}$_{ij}$ computed for each candidate service $s_j$$\in$$S^c_{i}$, while $w_{ij}$ is uniformly 1 (thus, C is always equal to the cardinality of $V_C$). It is evident that the solution to one problem is also the solution to the other (and vice versa). Since the reduction can be done in polynomial time, the Max-Quality problem is also NP-hard. +The MCKP can be reduced to the Max-Quality \problem in polynomial time, with $N_1,N_2,\ldots,N_t$ corresponding to the sets of compatible services $S^c_{1}, S^c_{2}, \ldots, S^c_{u}$, with $t$$=$$u$ and $n_i$ also the size of each set $S^c_{i}$. The profit $p_{ij}$ of item $j$$\in$$N_i$ corresponds to quality \textit{\q}$_{ij}$ computed for each candidate service $s_j$$\in$$S^c_{i}$, while $w_{ij}$ is uniformly 1 (thus, C is always equal to the cardinality of $V_C$). It is evident that the solution to one problem is also the solution to the other (and vice versa). Since the reduction can be done in polynomial time, the Max-Quality problem is also NP-hard. \vspace{0.5em} @@ -100,29 +82,13 @@ \subsection{NP-Hardness of the Max-Quality Pipeline Instantiation Problem}\label The ranking of \s{31} and \s{32} is affected by the environment state at the time of the ranking. For example, if the environment where the visualization is performed is a CT facility, then \s{31} is ranked first and \s{32} second because the facility is considered less risky than the cloud, and $Q_{31}$$>$$Q_{32}$. \end{example} -% The metrics established will enable the quantification of data loss pre- and post-transformations. -% In the event of multiple service interactions, each with its respective transformation, -% efforts will be made to minimize the loss of information while upholding privacy and security standards. -% Due to the exponential increase in complexity as the number of services and transformations grow, -% identifying the optimal path is inherently an NP-hard problem. -% As such, we propose some heuristics to approximate the optimal path as closely as possible. -%To evaluate their efficacy, the heuristically generated paths will be compared against the optimal solution. - \subsection{Heuristic}\label{subsec:heuristics} -%The computational challenge posed by the enumeration of all possible combinations within a given set is a well-established NP-hard problem.} -%The exhaustive exploration of such combinations swiftly becomes impractical in terms of computational time and resources, particularly when dealing with the analysis of complex pipelines. -%In response to this computational complexity, the incorporation of heuristic emerges as a strategy to try to efficiently address the problem. -%\hl{HO RIVISTO IL PARAGRAFO VELOCEMENTE GIUSTO PER DARE UN'INDICAZIONE. DOBBIAMO USARE LA FORMALIZZAZIONE E MAGARI FORMALIZZARE ANCHE LO PSEUDOCODICE.} We design and implement a heuristic algorithm built on a \emph{sliding window} for computing the pipeline instance maximizing quality \q. -%Our heuristic is built on a \emph{sliding window} and aims to maximize information \quality \emph{\q} according to quality metrics. -%At each step, a set of vertices in the pipeline template $\tChartFunction$ is selected according to a window of size \windowsize, which select a subset of the pipeline template starting at depth $i$ and ending at depth \windowsize+i-1. At each iteration $i$, a window of size \windowsize\ selects a subset of vertices in the pipeline template $\tChartFunction$, from vertices at depth $i$ to vertices at depth \windowsize$+$$i$$-$1. Service filtering and selection in \cref{sec:instance} are then executed to maximize quality $Q_w$ in window $w$. The heuristic returns as output the list of services instantiating all vertices at depth $i$. The sliding window $w$ is then shifted by 1 (i.e., $i$$=$$i$+1) and the filtering and selection process executed until \windowsize$+$$i$$-$1 is equal to length $l$ (max depth) of $\tChartFunction$, that is, the sliding window reaches the end of the template. In the latter case, the heuristic instantiates all remaining vertices and returns the pipeline instance $G'$. -%For example, in our service selection problem where the quantity of information lost needs to be minimized, the sliding window algorithm can be used to select services composition that have the lowest information loss within a fixed-size window. This strategy ensures that only services with low information loss are selected at each step, maximizing the pipeline quality \q. \newenvironment{redtext}{\footnotesize \color{gray}}{~~} \begin{figure}[!t] - % \begin{} \hrule\vspace{3pt} \begin{tabbing} \INPUT\\ @@ -170,19 +136,15 @@ \subsection{Heuristic}\label{subsec:heuristics} \hrule \vspace{10pt} \caption{\label{fig:slidingwindow-pseudocode} Pseudocode of the sliding window heuristic algorithm.} - % \end{footnotesize} \end{figure} The pseudocode of the heuristic algorithm is presented in \cref{fig:slidingwindow-pseudocode}. Function \textbf{SlidingWindowHeuristic} implements our heuristic; it takes the pipeline template $\tChartFunction$ and the window size \windowsize\ as input and returns the pipeline instance $G'$ and corresponding metric $M$ as output. Function \textbf{SlidingWindowHeuristic} retrieves the optimal service combination composing $G'$, considering the candidate services associated with each vertex in $\tChartFunction$ and the constraints (policies) in \emph{verticesList}. -%Initially, the function initializes $G'$ to store the pipeline instance (line 1). It iterates all sliding windows $w$ step 1 until the end of the pipeline template is reached (\textbf{for cycle} in line 2). Adding the service(s) selected at step $i$ to $G'$ by function \textbf{SelectService} (definied in line 10). Function \textbf{SelectService} takes as input index $i$ representing the starting depth of the window and the corresponding window size \windowsize. It initializes the best combination of services to \textit{empty} (line 11). It iterates through all possible combinations of services in the window using the Cartesian product of the service lists (\textbf{for cycle} in lines 13-16). If the current combination has quality metric M($G'_w$) higher than the best quality metric M($G^*_w$), current combination $G'_w$ updates the best combination $G^*_w$ (lines 14-15). Function \textbf{SelectService} then checks whether it is processing the last window (line 18). If yes, it returns the best combination $G^*_w$ (line 19). Otherwise, it returns the first service in the best combination $G^*_w$ (line 21). -Within each window, function \textbf{SlidingWindowHeuristic} finally iterates through the selected services to calculate the total quality metric $M$ (\textbf{for cycle} in lines 6-8). This metric is updated by summing the quality metrics of the selected services. The function concludes by returning the best pipeline instance $G'$ and the corresponding quality metric $M$ (line 9). - - +Within each window, function \textbf{SlidingWindowHeuristic} finally iterates through the selected services to calculate the total quality metric $M$ (\textbf{for cycle} in lines 6-8). This metric is updated by summing the quality metrics of the selected services. The function concludes by returning the best pipeline instance $G'$ and the corresponding quality metric $M$ (line 9). \ No newline at end of file diff --git a/pipeline_instance.tex b/pipeline_instance.tex index 359d622..35c1582 100644 --- a/pipeline_instance.tex +++ b/pipeline_instance.tex @@ -18,7 +18,7 @@ \section{Pipeline Instance}\label{sec:instance} Condition 1 requires that each selected service \sii{i} satisfies the policy requirements \P{i} of the corresponding vertex \vi{i} in the \pipelineTemplate, whereas Condition 2 is needed to preserve the process functionality, as it simply states that each service \sii{i} must satisfy the functional requirements \F{i} of the corresponding vertex \vi{i} in the \pipelineTemplate. -We then define a \emph{pipeline instantiation} function that takes as input a \pipelineTemplate \tChartFunction and a set $S^c$ of candidate services, split in a specific set of services $S^c_{i}$ for each vertex \vi{i}$\in$$\V_S$, and returns as output a \pipelineInstance \iChartFunction. Recall from Section~\ref{sec:funcannotation} that all candidate services meet the functional annotation in the template, meaning that Condition 2 in Definition~\ref{def:instance} is satisfied for all candidate services. +We then define a \emph{pipeline instantiation} function that takes as input a \pipelineTemplate \tChartFunction and a set $S^c$ of candidate services, and returns as output a \pipelineInstance \iChartFunction. We note that $S^c$ is partitioned in different set of services $S^c_{i}$, one for each vertex \vi{i}$\in$$\V_S$. Recall from Section~\ref{sec:funcannotation} that all candidate services meet the functional annotation in the template, meaning that Condition 2 in Definition~\ref{def:instance} is satisfied for all candidate services. The \pipelineInstance is generated by traversing the \pipelineTemplate with a breadth-first search algorithm, starting from the root vertex \vi{r}. Then, for each vertex $\vi{f}$ in the pipeline template, the corresponding vertex $\vii{f}$ is generated. @@ -29,9 +29,9 @@ \section{Pipeline Instance}\label{sec:instance} \item \textit{Selection Algorithm} -- The selection algorithm selects one service $s'_i$ for each set $S'_{i}$ of compatible services, which instantiates the corresponding vertex $\vii{i}$$\in$$\Vp$. There are many ways of choosing $s'_i$, we present our approach based on the maximization of data \quality \emph{\q} in Section \ref{sec:heuristics}. \end{enumerate} -When all vertices $\vi{i}$$\in$$V$ in $G^{\myLambda,\myGamma}$ have been visited, the \pipelineInstance G' is generated, with a service instance $s'_i$ for each \vii{i}$\in$\Vp. Vertex \vii{i} is still annotated with policies in \P{i} according to \myLambda, because policies in \P{i} are evaluated and enforced only when the pipeline instance is triggered before any service is executed. In the case of policy evaluation returns \emph{true}, data transformation \TP$\in$\P{i} is applied, otherwise a default transformation that removes all data is applied. +When all vertices $\vi{i}$$\in$$V$ in $G^{\myLambda,\myGamma}$ have been visited, the \pipelineInstance $G'$ is generated, with a service instance $s'_i$ for each \vii{i}$\in$\Vp. Vertex \vii{i} is still annotated with policies in \P{i} according to \myLambda, because policies in \P{i} are evaluated and enforced at runtime, only when the pipeline instance is triggered and before any service is executed. In the case of policy evaluation returns \emph{true}, data transformation \TP$\in$\P{i} is applied, otherwise a default transformation that removes all data is applied. -\begin{figure}[ht!] +\begin{figure}[!t] \centering \newcommand{\function}[1]{$\instanceChartAnnotation{}_{#1}$} \begin{tikzpicture}[scale=0.7] diff --git a/pipeline_instance_example.tex b/pipeline_instance_example.tex index 220061f..9aedbef 100644 --- a/pipeline_instance_example.tex +++ b/pipeline_instance_example.tex @@ -1,18 +1,15 @@ -%\subsection{Example}\label{sec:example_instace} - \begin{example}[\bf \pipelineInstance]\label{ex:instance} Let us consider a subset \{\vi{5}, \vi{6}, \vi{7}\} of the pipeline template $G^{\myLambda,\myGamma}$ in Example~\ref{ex:template}. -As presented in Table~\ref{tab:exisnt}(a), each vertex is labeled with policies (column \emph{candidate--$>$policy}), and then associated with different candidate services (column \emph{candidate}) and corresponding profile (column \emph{profile}). The filtering algorithm matches each candidate service profile with the policies in Table~\ref{tab:anonymization} annotating the corresponding vertex. It returns the set of services whose profile matches a policy (column \emph{filtering}): +As presented in Table~\ref{tab:exisnt}(a), each vertex is labeled with policies (column \emph{Vertex$\rightarrow$Policy}), and is associated with different candidate services (column \emph{Candidate}) and corresponding profile (column \emph{Profile}). The filtering algorithm matches each candidate service profile against the policies annotating the corresponding vertex (Table~\ref{tab:anonymization}). It returns the set of services whose profile satisfies a policy (column \emph{Filtering}): \begin{enumerate*}[label=\textit{\roman*})] - \item vertex \vi{5}, the filtering algorithm produces the set $S_{1}=\{s_{51},s_{52}\}$. Assuming that the dataset owner is ``CT'', the service profile of \s{51} matches \p{1} and the one of $\s{52}$ matches \p{2}. For $\s{53}$, there is no policy match and, thus, it is discarded; - \item vertex \vi{6}, the filtering algorithm returns the set $S'_2=\{s_{62},s_{63}\}$. Assuming that the dataset region is ``CT'', the service profile of $\s{62}$ matches \p{5} and the one of $\s{63}$ matches \p{6}. For $\s{61}$, there is no policy match and, thus, it is discarded; - \item vertex \vi{7}, the filtering algorithm returns the set $S'_3=\{s_{71},s_{72}\}$. Since policy \p{7} matches with any subject, the filtering algorithm keeps all services. + \item for vertex \vi{5}, the filtering algorithm produces the set $S_{1}=\{s_{51},s_{52}\}$. Assuming that the dataset owner is ``CT'', the service profile of \s{51} matches \p{1} and the one of $\s{52}$ matches \p{2}. For $\s{53}$, there is no policy match and, thus, it is discarded; + \item for vertex \vi{6}, the filtering algorithm returns the set $S'_2=\{s_{62},s_{63}\}$. Assuming that the dataset region is ``CT'', the service profile of $\s{62}$ matches \p{5} and the one of $\s{63}$ matches \p{6}. For $\s{61}$, there is no policy match and, thus, it is discarded; + \item for vertex \vi{7}, the filtering algorithm returns the set $S'_3=\{s_{71},s_{72}\}$. Since policy \p{7} matches against any subject, the filtering algorithm keeps all services. \end{enumerate*} -For each vertex \vii{i}, we select the matching service \sii{j} from $S'_i$ and incorporate it into a valid instance. For instance, we select $\s{51}$ for \vi{5}; $\s{62}$ for \vi{6}, and $\s{71}$ for \vi{7} -as depicted in \cref{tab:instance_example_valid}(a) (column \emph{instance}). We note that to move from a valid to an optimal instance, it is mandatory to evaluate candidate services based on specific quality metrics that reflect their impact on data quality, as discussed in the following of this paper. +For each vertex \vii{i}, we select a matching service \sii{j} from $S'_i$ and incorporate it into a valid instance. For instance, we select $\s{51}$ for \vi{5}; $\s{62}$ for \vi{6}, and $\s{71}$ for \vi{7} as depicted in \cref{tab:instance_example_valid}(a) (column \emph{instance}). We note that to move from a valid to an optimal instance, it is mandatory to evaluate candidate services based on specific quality metrics that reflect their impact on data quality, as discussed in the following of this paper. \begin{table*} \def\arraystretch{1.5} diff --git a/pipeline_template_example.tex b/pipeline_template_example.tex index e514762..34330fc 100644 --- a/pipeline_template_example.tex +++ b/pipeline_template_example.tex @@ -1,4 +1,4 @@ -\begin{table*}[ht!] +\begin{table*}[!t] \def\arraystretch{1.5} \centering \caption{Anonymization policies (a) and data transformations (b)}\label{tab:anonymization} diff --git a/system_model.tex b/system_model.tex index 4278a98..313605b 100644 --- a/system_model.tex +++ b/system_model.tex @@ -38,7 +38,7 @@ \subsection{Reference Scenario}\label{sec:service_definition} \cref{tab:dataset} presents a sample of the adopted dataset. Each row represents an inmate; each column includes the following attributes: date of download, a unique identifier, last entry date, race, gender, age of the individual, the bound value, offense, entry facility, and detainer. To serve the objectives of our study, we extended this dataset by introducing randomly generated first and last names. -\begin{table*}[ht!] +\begin{table*}[!t] \caption{Dataset sample} \label{tab:dataset} \centering @@ -74,7 +74,7 @@ \subsection{Reference Scenario}\label{sec:service_definition} \item \emph{Data visualization}, including the visualization of the results. \end{enumerate*} -\begin{figure}[ht!] +\begin{figure}[!t] \centering \begin{tikzpicture}[scale=0.9,y=-1cm] % vertexes