updated section 6 - intro and section 6.2

cb-unimi · cb-unimi · commit 85dc5a07c134 · 2024-02-23T15:04:33.000+01:00
diff --git a/metrics.tex b/metrics.tex
@@ -1,8 +1,13 @@
 \section{Maximizing the Pipeline Instance Quality}\label{sec:heuristics}
-The goal of this paper is to produce a pipeline instance in Definition~\ref{def:instance} with maximum quality. To this aim, we first highlight the crucial role of well-defined metrics (\cref{sec:metrics}) in assessing data quality across the pipeline. Inspired by existing literature, these metrics, categorized as quantitative and statistical, play a pivotal role in quantifying the impact of policy-driven transformations on the original dataset.
-We then prove that the problem of maximizing the pipeline instance quality is NP-hard (\cref{sec:nphard}). We finally present a parametric heuristic (\cref{subsec:heuristics}) tailored to address the computational complexities associated with enumerating all possible combinations within a given set. The primary aim of the heuristic is to approximate the optimal path for service interactions and transformations, particularly within the landscape of more complex pipelines composed numerous nodes and candidate services.
+    
+  %
+  % %Ovviamente non è sufficiente scegliere il best service per ogni vertice, ma diventa un problema complesso dove si devono calcolare/valutare tutte le possibili combinazioni dei servizi disponibili, tra le quali scegliere la migliore.    
+The goal of this paper is to produce a pipeline instance with maximum quality, i.e., guaranteeing a high level of data protection but at the same time the minimum amount of information lost across the pipeline. To this aim, we first discuss the crucial role of well-defined metrics (\cref{sec:metrics}) to specify and measure data quality, and describe the ones that will be used in the paper. 
+Then, we prove that the problem of maximizing the pipeline instance quality is NP-hard (\cref{sec:nphard}). Finally, we present a parametric heuristic (\cref{subsec:heuristics}) tailored to address the computational complexities associated with enumerating all possible combinations within a given set. The primary aim of the heuristic is to approximate the optimal path for service interactions and transformations, particularly within the landscape of more complex pipelines composed numerous nodes and candidate services.
 Our focus extends beyond identifying optimal combinations, encompassing an understanding of the quality changes introduced during the transformation processes.
 
+%Inspired by existing literature, these metrics, categorized as quantitative and statistical, play a pivotal role in quantifying the impact of policy-driven transformations on the original dataset.
+
 \subsection{Quality Metrics}\label{sec:metrics}
 Ensuring data quality is mandatory to implement data pipelines that provide high-quality results and decision-making along the whole pipeline execution. To this aim, we define a set of metrics evaluating the quality loss introduced by our policy-driven transformation in Section~\cite{ADD} on the input dataset \origdataset at each step of the data pipeline. Our metrics can be classified as \emph{quantitative} and \emph{statistical}~\cite{ADD}, and compare the input dataset \origdataset\ and dataset \transdataset\ generated by enforcing data protection requirements on \origdataset. 
 
@@ -55,32 +60,25 @@ \subsubsection{Weighted Jensen-Shannon Divergence}
 
 By incorporating weights into the JSD calculation, WJSD provides a more accurate measure of dissimilarity between X and Y, considering the importance of individual elements based on the assigned weights. This approach is particularly useful when elements in the datasets have varying levels of significance, enabling a more tailored analysis of dissimilarity.
 
-\subsection{NP-Hardness of the Pipeline Instantiation Process}\label{sec:nphard}
-We need to\\
-1 - define the quality metric\\
-2 - define the problem as a max-instance problem, cioè la definizione di un'istanza con max quality\\
-3 - definiamo che + NP-Hard
-
-\begin{problem}
-
-\end{problem}
+\subsection{NP-Hardness of the Max Quality Pipeline Instantiation Process}\label{sec:nphard}
 
-Note that while the overall problem is NP-hard, there is a component of the problem that is solvable in polynomial time: matching the profile of each service with the node policy.
-This can be done by iterating over each node and each service, checking if the service matches the node’s policy.
-This process would take $O(|N|*|S|)$ time. This is polynomial time complexity.
+ \begin{definition}[Max Quality Pipeline Instantiation Process]\label{def:MaXQualityInstance}
+Given \textit{dtloss}$_i$ the value of the quality metrics computed after applying the transformation of the policy matching the service selected to instantiate vertex  \vi{i}$\in$$\V_S$, the Max quality \problem is the case in which the \emph{pipeline instantiation} function returns a \pipelineInstance where the \textit{dtloss}$_i$ sum is maximized.
+\end{definition}
+ 
+The Max Quality \problem is a combinatorial selection problem and is NP-hard, as stated by the following theorem.
+Note that while the overall problem is NP-hard, there is a component of the problem that is solvable in polynomial time: matching the profile of each service with the node policy. This can be done by iterating over each node and each service, checking if the service matches the node’s policy. This process would take $O(|N|*|S|)$ time. This is polynomial time complexity.
 
-The \problem is NP-hard, as stated by the following theorem
 \begin{theorem}
-  The \problem is NP-Hard
+  The Max Quality  \problem is NP-Hard.
 \end{theorem}
-\emph{Proof: }
-
-The proof is a reduction from the NP-hard problem. We map each service s in S to an item in the Knapsack Problem.
-The value of the item is equivalent to the calculated metric, and the weight of the item is uniformly 1, as we can choose each service once. The capacity of the knapsack is set to the number of nodes.
-Our problem can now be viewed as a variant of the Knapsack Problem: find the subset of items(services)
-that maximizes the total value (score) without exceeding the capacity of the knapsack (number of nodes).
-The Knapsack Problem is NP-hard.
-Since our problem can be reduced to the Knapsack Problem in polynomial time, our problem is also NP-hard.
+\emph{Proof: } 
+The proof is a reduction from the multiple-choice knapsack problem (MCKP), a classified NP-hard combinatorial optimization problem, which is a generalization of the simple knapsack problem (KP) \cite{}. In the MCKP problem, there are $t$ mutually disjoint classes $N_1,N_2,…N_t$ of items to pack in some knapsack of capacity $C$, class $N_i$ having size $n_i$. Each item $j \in N_i$ has a profit $p_{ij}$ and a weight $w_{ij}$; the problem is to choose one item from each class such that the profit sum is maximized without having the weight sum to exceed C. 
+
+The MCKP can be reduced to the Max quality \problem in plynomial time, with $N_1,N_2,…N_t$ corresponding to $S^c_{1}, S^c_{1}, ..., S^c_{u},$, $t=u$ and $n_i$ the size of $S^c_{i}$. The profit $p_{ij}$ of item $j \in N_i$ corresponds to \textit{dtloss}$_{ij}$ computed for each candidate service $s_j \in S^c_{i}$, while $w_{ij}$ is uniformly 1 (thus, C is always equal to the cardinality of $V_C$).
+
+Since the reduction can be done in polynomial time, our problem is also NP-hard. (non è sufficiente, bisogna provare che la soluzione di uno e' anche soluzione dell'altro)
+
 
 \begin{example}[Max-Quality Pipeline Instance]
 \end{example}