diff --git a/experiment.tex b/experiment.tex index 348b115..0c43764 100644 --- a/experiment.tex +++ b/experiment.tex @@ -8,14 +8,11 @@ \subsection{Testing Infrastructure and Experimental Settings}\label{subsec:exper The simulator first defines the pipeline template as a sequence of vertexes in the range $3-7$. We recall that alternative vertexes are modeled in different pipeline templates, while parallel vertexes only add a fixed execution time that is negligible and do not affect the quality of our approach. -Each vertex is associated with a (set of) policy with transformations varying in three classes: - -\begin{itemize*}[label=roman*] - \item \textit{Confident}: Adjusts data removal to a percentage within $[0.8,1]$. - \item \textit{Diffident}: Sets data removal percentage to $[0.33,0.5]$. - \item \textit{Average}: Modifies data removal percentage within $[0.33,1]$. +Each vertex is associated with a (set of) policy with transformations varying in two classes: +\begin{itemize*} + \item \average: data removal percentage within $[0.5,0.8]$. + \item \wide: data removal percentage within $[0.20,1]$. \end{itemize*} -set of functionally-equivalent candidate services is randomly generated. Upon setting the sliding window size, the simulator selects a subset of vertexes along with their corresponding candidate services. It then generates all possible service combinations for the chosen vertexes. @@ -128,14 +125,23 @@ \subsection{Perfomance}\label{subsec:experiments_performance} \subsection{Quality}\label{subsec:experiments_quality} We finally evaluated the quality of our heuristic comparing, where possible, its results with the optimal solution retrieved by executing the exhaustive approach. The latter executes with window size equals to the number of vertexes in the pipeline template, and provides the best, among all possible, solutions. -We run our experiments in the three settings in Section \cref{subsec:experiments_infrastructure}, namely, confident, diffident, average, and varied: \emph{i)} the number of vertexes in the pipeline template in [3,6], \emph{ii)} the window size in [1,$|$max$_v$$|$], where max$_v$ is the number of vertexes in the pipeline template, and \emph{iii)} the number of candidate services for each vertex in the pipeline template in [2, 6]. +We run our experiments in the two settings in Section \cref{subsec:experiments_infrastructure}, namely, \average and \wide, and varied: \emph{i)} the number of vertexes in the pipeline template in [3,7], \emph{ii)} the window size in [1,$|$max$_v$$|$], where max$_v$ is the number of vertexes in the pipeline template, and \emph{iii)} the number of candidate services for each vertex in the pipeline template in [2, 7]. -\cref{fig:quality_window_bad,fig:quality_window_average,fig:quality_window_good} presents our results using metric Jensen-Shannon Divergence. +\cref{fig:quality_window_average_perce,fig:quality_window_perce_wide} presents our results the quantitive metrics in \cref{subsec:metrics} for the \wide and \average settings, respectively. +Value are normalized to the optimal solution retrieved by the exhaustive approach. +% +When a \wide setting is used, \cref{fig:quality_window_perce_wide}, the quality ratio ranges from 0.6 to 1, with the highest quality retrieved for the pipeline template with 3 vertices and the lowest with 7 vertices. +% +In particular, the quality ratio ranges from 0.88 (greedy approach) to 1 () for a 3-vertex pipeline with a loss of 12\% in the worst case, from 0.81 to 0.92 for a 4-vertex pipeline with a loss of 8,7\%, from 0.84 to 0.89 for a 5-vertex pipeline with a loss of 5,61\%, from 0.8 to 0.89 for a 6-vertex pipeline with a loss of 10,11\%, and from 0.72 to 0.88 for a 7-vertex pipeline with a loss of 18,18\%. W % -When a diffident setting is used, \cref{fig:quality_window_bad}, the quality range from 0.7 to 0.9, with the highest quality retrieved for the pipeline template with 3 vertices and the lowest with 7 vertices. -In particular, the quality ranges from 0.88 (greedy approach) to 0.93 (exhaustive approach) for a 3-vertex pipeline with a loss of 5,38\% in the worst case, from 0.84 to 0.92 for a 4-vertex pipeline with a loss of 8,7\%, from 0.84 to 0.89 for a 5-vertex pipeline with a loss of 5,61\%, from 0.8 to 0.89 for a 6-vertex pipeline with a loss of 10,11\%, and from 0.72 to 0.88 for a 7-vertex pipeline with a loss of 18,18\%. We note that the benefit of an increasing window size can be appreciated with lower numbers, reaching a sort of saturation around the average length (e.g., window of length 4 with a 7-vertex pipeline) where the quality with different length almost overlaps. The only exception is for 6-vertex pipeline where the overapping starts with window size 2. However, this might be due to the specific setting and therefore does not generalize. +We note that the benefit of an increasing window size can be appreciated with lower numbers, reaching a sort of saturation around the average length (e.g., window of length 6 with a 7-vertex pipeline) where the quality ratio overlaps. The only exception is for 6-vertex pipeline where the overapping starts with window size 2. However, this might be due to the specific setting and therefore does not generalize. %Thus because the heuristic has more services to choose from and can find a better combination. We also observe that, as the window size increase, the quality increase as well. This suggests that the heuristic performs better when it has a broader perspective of the data it is governing. +It's worth noting that lower window sizes are more unstable, with the quality ratio varying significantly between different configuration while higher window sizes tend to stabilize the quality ratio across different configuration. + +When an \average setting is used, \cref{fig:quality_window_average_perce}, the quality ratio for the worst case ranges from 0.72 to 9.96, with the highest quality retrieved for the pipeline template with 3 vertices and the lowest with 7 vertices. + + \hl{QUESTO E' PIU' DA CONCLUSIONE FINALE.} Finally, the data suggest that while larger window sizes generally lead to better performance, there might exist a point where the balance between window size and performance is optimized. Beyond this point, the incremental gains in metric values may not justify the additional computational resources or the complexity introduced by larger windows. @@ -145,30 +151,30 @@ \subsection{Quality}\label{subsec:experiments_quality} \begin{subfigure}{0.33\textwidth} \includegraphics[width=\textwidth]{Images/graphs/newwindow_quality_performance_diff_perce_n7_s7_20_100_n3.png} \caption{3 vertices} - \label{fig:quality_window_average_3n} + \label{fig:quality_window_perce_wide_3n} \end{subfigure} \hfill \begin{subfigure}{0.33\textwidth} \includegraphics[width=\textwidth]{Images/graphs/newwindow_quality_performance_diff_perce_n7_s7_20_100_n4} \caption{4 vertices} - \label{fig:quality_window_average_4n} + \label{fig:quality_window_perce_wide_4n} \end{subfigure} \hfill \begin{subfigure}{0.33\textwidth} \includegraphics[width=\textwidth]{Images/graphs/newwindow_quality_performance_diff_perce_n7_s7_20_100_n5} \caption{5 vertices} - \label{fig:quality_window_average_5n} + \label{fig:quality_window_perce_wide_5n} \end{subfigure} \hfill \begin{subfigure}{0.33\textwidth} \includegraphics[width=\textwidth]{Images/graphs/newwindow_quality_performance_diff_perce_n7_s7_20_100_n6} \caption{6 vertices} - \label{fig:quality_window_average_6n} + \label{fig:quality_window_perce_wide_6n} \end{subfigure} \begin{subfigure}{0.33\textwidth} \includegraphics[width=\textwidth]{Images/graphs/newwindow_quality_performance_diff_perce_n7_s7_20_100_n7} \caption{6 vertices} - \label{fig:quality_window_average_7n} + \label{fig:quality_window_perce_wide_7n} \end{subfigure} % \hfill % \begin{subfigure}{0.33\textwidth} @@ -176,40 +182,40 @@ \subsection{Quality}\label{subsec:experiments_quality} % \caption{7 vertices} % \label{fig:third} % \end{subfigure} - \caption{ Quality evaluation with \textit{Confident} profile.} - \label{fig:quality_window_average} + \caption{ Quality evaluation with \wide profile.} + \label{fig:quality_window_perce_wide} \end{figure*} -\begin{figure*}[ht] +\begin{figure*}[h] \centering \begin{subfigure}{0.33\textwidth} \includegraphics[width=\textwidth]{Images/graphs/newwindow_quality_performance_diff_perce_n7_s7_50_89_n3} \caption{3 vertices} - \label{fig:quality_window_good_3n} + \label{fig:quality_window_average_perce_3n} \end{subfigure} \hfill \begin{subfigure}{0.33\textwidth} \includegraphics[width=\textwidth]{Images/graphs/newwindow_quality_performance_diff_perce_n7_s7_50_89_n4} \caption{4 vertices} - \label{fig:quality_window_good_4n} + \label{fig:quality_window_average_perce_4n} \end{subfigure} \hfill \begin{subfigure}{0.33\textwidth} \includegraphics[width=\textwidth]{Images/graphs/newwindow_quality_performance_diff_perce_n7_s7_50_89_n5} \caption{5 vertices} - \label{fig:quality_window_good_5n} + \label{fig:quality_window_average_perce_5n} \end{subfigure} \hfill \begin{subfigure}{0.33\textwidth} \includegraphics[width=\textwidth]{Images/graphs/newwindow_quality_performance_diff_perce_n7_s7_50_89_n6} \caption{6 vertices} - \label{fig:quality_window_good_6n} + \label{fig:quality_window_average_perce_6n} \end{subfigure} \begin{subfigure}{0.33\textwidth} \includegraphics[width=\textwidth]{Images/graphs/newwindow_quality_performance_diff_perce_n7_s7_50_89_n7} \caption{6 vertices} - \label{fig:quality_window_good_7n} + \label{fig:quality_window_average_perce_7n} \end{subfigure} % \hfill % \begin{subfigure}{0.33\textwidth} @@ -217,23 +223,23 @@ \subsection{Quality}\label{subsec:experiments_quality} % \caption{7 vertices} % \label{fig:third} % \end{subfigure} - \caption{ Quality evaluation with \textit{Confident} profile.} - \label{fig:quality_window_good} + \caption{ Quality evaluation with \average profile.} + \label{fig:quality_window_average_perce} \end{figure*} -\begin{figure*}[ht] +\begin{figure*}[h] \centering \begin{subfigure}{0.33\textwidth} \includegraphics[width=\textwidth]{Images/graphs/window_quality_performance_diff_qual_n7_s7_20_100_n3} \caption{3 vertices} - \label{fig:quality_window_average_qualitative_n3} + \label{fig:quality_window_wide_qualitative_n3} \end{subfigure} \hfill \begin{subfigure}{0.33\textwidth} \includegraphics[width=\textwidth]{Images/graphs/window_quality_performance_diff_qual_n7_s7_20_100_n4} \caption{4 vertices} - \label{fig:quality_window_average_qualitative_n4} + \label{fig:quality_window_wide_qualitative_n4} \end{subfigure} \hfill \begin{subfigure}{0.33\textwidth} @@ -245,81 +251,17 @@ \subsection{Quality}\label{subsec:experiments_quality} \begin{subfigure}{0.33\textwidth} \includegraphics[width=\textwidth]{Images/graphs/window_quality_performance_diff_qual_n7_s7_20_100_n6} \caption{6 vertices} - \label{fig:quality_window_average_qualitative_n6} + \label{fig:quality_window_wide_qualitative_n6} \end{subfigure} \begin{subfigure}{0.33\textwidth} \includegraphics[width=\textwidth]{Images/graphs/window_quality_performance_diff_qual_n7_s7_20_100_n7} \caption{6 vertices} - \label{fig:quality_window_average_qualitative_n7} + \label{fig:quality_window_wide_qualitative_n7} \end{subfigure} - % \hfill - % \begin{subfigure}{0.33\textwidth} - % \includegraphics[width=\textwidth]{Images/graphs/quality_plot_average_n7.eps} - % \caption{7 vertices} - % \label{fig:third} - % \end{subfigure} - \caption{ Quality evaluation with \textit{Confident} profile.} - \label{fig:quality_window_average_qualitative} + + \caption{ Quality evaluation with \wide profile.} + \label{fig:quality_window_wide_qualitative} \end{figure*} -% \begin{figure*}[ht!] -% \centering -% \begin{subfigure}{0.33\textwidth} -% \includegraphics[width=\textwidth]{Images/graphs/quality_plot_bad_percentage_n3.eps} -% \caption{3 vertices} -% \label{fig:quality_window_bad_percentage_a} -% \end{subfigure} -% \hfill -% \begin{subfigure}{0.33\textwidth} -% \includegraphics[width=\textwidth]{Images/graphs/quality_plot_bad_percentage_n4.eps} -% \caption{4 vertices} -% \label{fig:quality_window_bad_percentage_b} -% \end{subfigure} -% \hfill -% \begin{subfigure}{0.33\textwidth} -% \includegraphics[width=\textwidth]{Images/graphs/quality_plot_bad_percentage_n5.eps} -% \caption{5 vertices} -% \label{fig:quality_window_bad_percentage_c} -% \end{subfigure} -% \hfill -% \begin{subfigure}{0.33\textwidth} -% \includegraphics[width=\textwidth]{Images/graphs/quality_plot_bad_percentage_n6.eps} -% \caption{6 vertices} -% \label{fig:quality_window_bad_percentage_d} -% \end{subfigure} -% % \hspace{0.04\textwidth} -% % \begin{subfigure}{0.33\textwidth} -% % \includegraphics[width=\textwidth]{Images/graphs/quality_plot_bad_n7.eps} -% % \caption{7 vertices} -% % \label{fig:quality_window_bad_percentage_e} -% % \end{subfigure} - -% \caption{Quality Percentage evaluation with \textit{Diffident} profile.} -% \label{fig:quality_window_bad_percentage} -% \end{figure*} - -% % We recall that we considered three different setting, confident, diffident, average, varying the policy transformations, that is, the amount of data removal at each vertex. Setting confident assigns to each policy a transformation that changes the amount of data removal in the interval [x,y] (Jaccard coefficient) or decreases the probability distribution dissimilarity in the interval [x,y] (Jensen-Shannon Divergence). Setting diffident assigns to each policy a transformation that changes the amount of data removal in the interval [x,y] (Jaccard coefficient) or decreases the probability distribution dissimilarity in the interval [x,y] (Jensen-Shannon Divergence). Setting average assigns to each policy a transformation that changes the amount of data removal in the interval [x,y] (Jaccard coefficient) or decreases the probability distribution dissimilarity in the interval [x,y] (Jensen-Shannon Divergence). -% We finally evaluated the quality of our heuristic comparing, where possible, -% its results with the optimal solution retrieved by executing the exhaustive approach. -% The latter executes with window size equals to the number of services per vertex and provides the best, -% among all possible, solution. - -% The number of vertexes has been varied from 3 to 7, while the number of services per vertex has been set from 2 to 6. -% The experiments have been conducted with different service data pruning profiles. - -% \hl{DOBBIAMO SPIEGARE COSA ABBIAMO VARIATO NEGLI ESPERIMENTI E COME, WINDOW SIZE, NODI, ETC. - -% LE IMMAGINI CHE ABBIAMO SONO SOLO QUELLE 5? POSSIAMO ANCHE INVERTIRE GLI ASSI E AGGIUNGERE VISUALI DIVERSE} - -% <<<<<<< HEAD -% \cref{fig:quality_window} presents our results with setting \hl{confident} and metric Jaccard coefficient. \cref{fig:quality_window}(a)--(e) \hl{aggiungere le lettere e uniformare l'asse y} present the retrieved quality varying the number of vertexes in [3, 7], respectively. Each figure in \cref{fig:quality_window}(a)--(e) varies the number of candidate services at each node in the range [2, 6] and the window size W in the range [1, $|$vertexes$|$]. -% \hl{aggiungiamo i numeri piu significativi (asse y).} -% From the results, some clear trends emerge. As the number of vertexes increases, the metric values tend to decrease (better data quality) as the window size increases across different node configurations. -% This suggests that the heuristic performs better when it has a broader perspective of the data and services. The trend is consistent across all node cardinalities (from three to seven), indicating that the heuristic's enhanced performance with larger window sizes is not confined to a specific setup but rather a general characteristic of its behavior. -% Finally, the data suggest that while larger window sizes generally lead to better performance, -% there might exist a point where the balance between window size and performance is optimized. \hl{For instance, ...} -% Beyond this point, the incremental gains in metric values may not justify the additional computational resources or the complexity introduced by larger windows. - -% \hl{RIPETERE PER TUTTI I SETTINGS} \begin{figure} @@ -328,19 +270,3 @@ \subsection{Quality}\label{subsec:experiments_quality} \label{fig:perf_exhaustive} \end{figure} -% \begin{figure}[!t] -% \includegraphics[width=0.95\columnwidth]{graphs/window_performance.eps} -% \caption{Preliminary performance evaluation.\hl{METTERE LE 4 IMG NON UN'UNICA EPS}} -% \label{fig:perf_window} -% \end{figure} - - -% \begin{figure}[!t] -% \includegraphics[width=0.95\columnwidth]{graphs/window_quality.eps} -% \caption{Quality evaluation.\hl{METTERE LE 4 IMG NON UN'UNICA EPS}} -% \label{fig:quality_window} -% \end{figure} -%======= - - -%In the figures each chart represents a configuration with a specific number of vertexes, ranging from 3 to 7. On the x-axis, the number of services is plotted, which ranges from 2 to 7. The y-axis represents the metric value. Each chart shows different window sizes, labeled as W Size 1, W Size 2, and so on, up to the maximum window size. diff --git a/macro.tex b/macro.tex index 130ff1a..26618ac 100644 --- a/macro.tex +++ b/macro.tex @@ -69,7 +69,8 @@ \newcommand{\pthree}{$\langle service\_owner \neq dataset\_owner AND owner \neq partner(dataset\_owner)$} % \newcommand{\function}{$\instanceChartAnnotation{}$} % \newcommand{\function}{$\templateChartAnnotation$} - +\newcommand{\average}{\textit{average}\xspace} +\newcommand{\wide}{\textit{wide}\xspace} \newcommand{\problem}{Pipeline Instantiation Process } \newcommand{\xmark}{\ding{55}}% \newcommand{\cmark}{\ding{51}}% diff --git a/main.pdf b/main.pdf index e722215..855f263 100644 Binary files a/main.pdf and b/main.pdf differ diff --git a/metrics.tex b/metrics.tex index 4fd90bc..7565ba1 100644 --- a/metrics.tex +++ b/metrics.tex @@ -25,22 +25,20 @@ \subsubsection{Jaccard coefficient} \subsubsection{Qualitative Metric} -We propose a metric that enables the measurement of differences in statistical terms between two distributions. The suggested metric is based on the well-known Jensen-Shannon Divergence, which is defined as follows: +We propose a metric that enables the measurement of the distance of two distributions. The suggested metric is based on the well-known Jensen-Shannon Divergence, which is defined as follows: The Jensen-Shannon divergence (JSD) is a qualitative metric that can be used to measure the dissimilarity between the probability distributions of two datasets. -It is a symmetrized version of the KL divergence~\cite{Fuglede}. - -The JSD between X and Y is defined as: - +It is a symmetrized version of the KL divergence~\cite{Fuglede} and is defined as: \[JSD(X, Y) = \frac{1}{2} \left( KL(X || M) + KL(Y || M) \right)\] - -where X and Y are two datasets of the same size, and M$=$0.5*(X+Y) is the average distribution. - +% +where X and Y are two distribution of the same size, and M$=$0.5*(X+Y) is the average distribution. JSD incorporates both the KL divergence from X to M and from Y to M. It provides a balanced measure of dissimilarity that is symmetric and accounts for the contribution from both datasets. % JSD can compare the dissimilarity of the two datasets, providing a symmetric and normalized measure that considers the overall data distribution. % -It provides a more comprehensive understanding of the dissimilarity between X and Y, taking into account the characteristics of both datasets. +However, the JSD is applicable solely to statistical distributions and not directly to datasets. Therefore, our metric is computed by applying the JSD to each column of the dataset. The results obtained are then aggregated using a weighted average. The weights are determined by the ratio of distinct elements to the total number of elements in the column, using the following formula: +\[\text{Weighted JSD} = \sum_{i=1}^n w_i \cdot \text{JSD}_i\] +where \(w_i = \frac{n_i}{N}\) represents the weight for the \(i\)-th column, with \(n_i\) being the number of distinct elements in that column and \(N\) the total number of elements in the dataset. Each \(\text{JSD}_i\) is the Jensen-Shannon Divergence computed for the \(i\)-th column. \vspace{0.5em} @@ -61,40 +59,40 @@ \subsection{NP-Hardness of the Max Quality Pipeline Instantiation Process}\label \emph{Proof: } The proof is a reduction from the multiple-choice knapsack problem (MCKP), a classified NP-hard combinatorial optimization problem, which is a generalization of the simple knapsack problem (KP) \cite{}. In the MCKP problem, there are $t$ mutually disjoint classes $N_1,N_2,\ldots,N_t$ of items to pack in some knapsack of capacity $C$, class $N_i$ having size $n_i$. Each item $j$$\in$$N_i$ has a profit $p_{ij}$ and a weight $w_{ij}$; the problem is to choose one item from each class such that the profit sum is maximized without having the weight sum to exceed C. - The MCKP can be reduced to the Max quality \problem in polynomial time, with $N_1,N_2,\ldots,N_t$ corresponding to $S^c_{1}, S^c_{1}, \ldots, S^c_{u},$, $t$$=$$u$ and $n_i$ the size of $S^c_{i}$. The profit $p_{ij}$ of item $j$$\in$$N_i$ corresponds to \textit{dtloss}$_{ij}$ computed for each candidate service $s_j$$\in$$S^c_{i}$, while $w_{ij}$ is uniformly 1 (thus, C is always equal to the cardinality of $V_C$). +The MCKP can be reduced to the Max quality \problem in polynomial time, with $N_1,N_2,\ldots,N_t$ corresponding to $S^c_{1}, S^c_{1}, \ldots, S^c_{u},$, $t$$=$$u$ and $n_i$ the size of $S^c_{i}$. The profit $p_{ij}$ of item $j$$\in$$N_i$ corresponds to \textit{dtloss}$_{ij}$ computed for each candidate service $s_j$$\in$$S^c_{i}$, while $w_{ij}$ is uniformly 1 (thus, C is always equal to the cardinality of $V_C$). - Since the reduction can be done in polynomial time, our problem is also NP-hard. (non è sufficiente, bisogna provare che la soluzione di uno e' anche soluzione dell'altro) +Since the reduction can be done in polynomial time, our problem is also NP-hard. (non è sufficiente, bisogna provare che la soluzione di uno e' anche soluzione dell'altro) - \begin{example}[Max-Quality Pipeline Instance] - Let us consider a subset \{\vi{5}, \vi{6}, \vi{7}\} of the pipeline template \tChartFunction in \cref{sec:example_instace}. - Each vertex is associated with three candidate services, each having a profile. The filtering algorithm matches each candidate service's profile with the policies annotating the corresponding vertex. It returns the set of services whose profile matches a policy. +\begin{example}[Max-Quality Pipeline Instance] + Let us consider a subset \{\vi{5}, \vi{6}, \vi{7}\} of the pipeline template \tChartFunction in \cref{sec:example_instace}. + Each vertex is associated with three candidate services, each having a profile. The filtering algorithm matches each candidate service's profile with the policies annotating the corresponding vertex. It returns the set of services whose profile matches a policy. - The comparison algorithm is then applied to the set of services $S'_*$ and it returns a ranking of the services. - The ranking is based on the amount of data that is anonymized by the service. - The ranking is listed in \cref{tab:instance_example_maxquality} and it is based on the transformation function of the policies, - assuming that a more restrictive transformation function anonymizes more data affecting negatively the position in the ranking. - For example, \s{11} is ranked first because it anonymizes less data than \s{12} and \s{13}. - The ranking of \s{22} and \s{23} is based on the same logic. - Finally, the ranking of \s{31}, \s{32} is influenced by the environment state at the time of the ranking. - For example, if the environment in which the visualization is performed is a CT facility, then \s{31} is ranked first and \s{32} second; - thus because the facility is considered a less risky environment than the cloud. + The comparison algorithm is then applied to the set of services $S'_*$ and it returns a ranking of the services. + The ranking is based on the amount of data that is anonymized by the service. + The ranking is listed in \cref{tab:instance_example_maxquality} and it is based on the transformation function of the policies, + assuming that a more restrictive transformation function anonymizes more data affecting negatively the position in the ranking. + For example, \s{11} is ranked first because it anonymizes less data than \s{12} and \s{13}. + The ranking of \s{22} and \s{23} is based on the same logic. + Finally, the ranking of \s{31}, \s{32} is influenced by the environment state at the time of the ranking. + For example, if the environment in which the visualization is performed is a CT facility, then \s{31} is ranked first and \s{32} second; + thus because the facility is considered a less risky environment than the cloud. - \end{example} +\end{example} - % The metrics established will enable the quantification of data loss pre- and post-transformations. - % In the event of multiple service interactions, each with its respective transformation, - % efforts will be made to minimize the loss of information while upholding privacy and security standards. - % Due to the exponential increase in complexity as the number of services and transformations grow, - % identifying the optimal path is inherently an NP-hard problem. - % As such, we propose some heuristics to approximate the optimal path as closely as possible. - %To evaluate their efficacy, the heuristically generated paths will be compared against the optimal solution. +% The metrics established will enable the quantification of data loss pre- and post-transformations. +% In the event of multiple service interactions, each with its respective transformation, +% efforts will be made to minimize the loss of information while upholding privacy and security standards. +% Due to the exponential increase in complexity as the number of services and transformations grow, +% identifying the optimal path is inherently an NP-hard problem. +% As such, we propose some heuristics to approximate the optimal path as closely as possible. +%To evaluate their efficacy, the heuristically generated paths will be compared against the optimal solution. - \subsection{Heuristic}\label{subsec:heuristics} - %The computational challenge posed by the enumeration of all possible combinations within a given set is a well-established NP-hard problem.} - %The exhaustive exploration of such combinations swiftly becomes impractical in terms of computational time and resources, particularly when dealing with the analysis of complex pipelines. - %In response to this computational complexity, the incorporation of heuristic emerges as a strategy to try to efficiently address the problem. - \hl{HO RIVISTO IL PARAGRAFO VELOCEMENTE GIUSTO PER DARE UN'INDICAZIONE. DOBBIAMO USARE LA FORMALIZZAZIONE E MAGARI FORMALIZZARE ANCHE LO PSEUDOCODICE.} We design and implement a heuristic algorithm for computing the pipeline instance maximizing data quality. Our heuristic is built on a \emph{sliding window} and aims to minimize information loss according to quality metrics. At each step, a set of vertexes in the pipeline template $\tChartFunction$ is selected according to a specific window w=[i,j], where $i$ and $j$ are the starting and ending depth of window w. Service filtering and selection in Section~\ref{sec:instance} are then executed to minimize information loss in window w. The heuristic returns as output the list of services instantiating vertexes at depth $i$. A new window w=[i+1,j+1] is considered until $j$+1 is equal to the max depth of $\tChartFunction$, that is the window reaches the end of the template. +\subsection{Heuristic}\label{subsec:heuristics} +%The computational challenge posed by the enumeration of all possible combinations within a given set is a well-established NP-hard problem.} +%The exhaustive exploration of such combinations swiftly becomes impractical in terms of computational time and resources, particularly when dealing with the analysis of complex pipelines. +%In response to this computational complexity, the incorporation of heuristic emerges as a strategy to try to efficiently address the problem. +\hl{HO RIVISTO IL PARAGRAFO VELOCEMENTE GIUSTO PER DARE UN'INDICAZIONE. DOBBIAMO USARE LA FORMALIZZAZIONE E MAGARI FORMALIZZARE ANCHE LO PSEUDOCODICE.} We design and implement a heuristic algorithm for computing the pipeline instance maximizing data quality. Our heuristic is built on a \emph{sliding window} and aims to minimize information loss according to quality metrics. At each step, a set of vertexes in the pipeline template $\tChartFunction$ is selected according to a specific window w=[i,j], where $i$ and $j$ are the starting and ending depth of window w. Service filtering and selection in Section~\ref{sec:instance} are then executed to minimize information loss in window w. The heuristic returns as output the list of services instantiating vertexes at depth $i$. A new window w=[i+1,j+1] is considered until $j$+1 is equal to the max depth of $\tChartFunction$, that is the window reaches the end of the template. %For example, in our service selection problem where the quantity of information lost needs to be minimized, the sliding window algorithm can be used to select services composition that have the lowest information loss within a fixed-size window. This strategy ensures that only services with low information loss are selected at each step, minimizing the overall information loss. Pseudo-code for the sliding window algorithm is presented in Algorithm 1. diff --git a/system_model.tex b/system_model.tex index 8a77995..e64d5cc 100644 --- a/system_model.tex +++ b/system_model.tex @@ -1,17 +1,17 @@ \section{System Model and Service Pipeline}\label{sec:requirements} \st{Big data is highly dependent on cloud-edge computing, which makes extensive use of multitenancy. -Multitenancy permits sharing one instance of infrastructures, platforms or applications by multiple tenants to optimize costs. This leads to common scenarios where a service provider offers subscription-based analytics capabilities in the cloud, or a single data lake is accessed by multiple customers. Big data pipelines then mix data and services which belong to various organizations, posing a serious risk of potential privacy and security violations. -We propose a data governance framework tailored to contemporary data-driven pipelines, which aims to limit the privacy and security risks. The primary objective of this framework is to facilitate the assembly of data processing services, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements.} + Multitenancy permits sharing one instance of infrastructures, platforms or applications by multiple tenants to optimize costs. This leads to common scenarios where a service provider offers subscription-based analytics capabilities in the cloud, or a single data lake is accessed by multiple customers. Big data pipelines then mix data and services which belong to various organizations, posing a serious risk of potential privacy and security violations. + We propose a data governance framework tailored to contemporary data-driven pipelines, which aims to limit the privacy and security risks. The primary objective of this framework is to facilitate the assembly of data processing services, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements.} In the following of this section, we present our system model (Section \ref{sec:systemmodel}) and our reference scenario (Section \ref{sec:service_definition}). \subsection{System Model}\label{sec:systemmodel} \st{In today's data landscape, the coexistence of data quality and data privacy is critical to support high-value services and pipelines. The increase in data production, collection, and usage has led to a split in scientific research priorities. -%This has resulted in two main focus areas. -First, researchers are exploring methods to optimize the usage of valuable data. Here, ensuring data quality is vital, and requires accuracy, reliability, and soundness for analytical purposes. -Second, there is a need to prioritize data privacy and security. This involves safeguarding confidential information and complying with strict privacy regulations. These two research directions are happening at the same time, though there are not many solutions that find a good balance between them. + %This has resulted in two main focus areas. + First, researchers are exploring methods to optimize the usage of valuable data. Here, ensuring data quality is vital, and requires accuracy, reliability, and soundness for analytical purposes. + Second, there is a need to prioritize data privacy and security. This involves safeguarding confidential information and complying with strict privacy regulations. These two research directions are happening at the same time, though there are not many solutions that find a good balance between them. -Our approach seeks to harmonize these objectives by establishing a data governance framework that balances privacy and data quality. } + Our approach seeks to harmonize these objectives by establishing a data governance framework that balances privacy and data quality. } Our system model is derived by a generic big-data framework and is composed of the following parties: \begin{description} \item[Service,] a software distributed by a \textbf{service provider} that performs a specific task \st{according to access control privileges on data}; %, a service can be tagged with some policies %, a service is characterized by two function: the service function and the policy function.