more

SESARLab · Apr 26, 2024 · d2ac76d · d2ac76d
1 parent 0b8b25b
commit d2ac76d
Show file tree

Hide file tree

Showing 4 changed files with 24 additions and 23 deletions.
diff --git a/experiment.tex b/experiment.tex
@@ -9,10 +9,10 @@ \subsection{Testing Infrastructure and Experimental Settings}\label{subsec:exper
 We recall that alternative vertexes are modeled in different pipeline templates,
 while parallel vertexes only add a fixed execution time that is negligible and do not affect the quality of our approach.
 Each vertex is associated with a (set of) policy with transformations varying in two classes:
-\begin{itemize*}[]
+\begin{enumerate*}[label=\textit{\roman*})]
   \item \average: data removal percentage within $[0.5,0.8]$.
   \item \wide:    data removal percentage within $[0.20,1]$.
-\end{itemize*}
+\end{enumerate*}
 
 Upon setting the sliding window size, the simulator selects a subset of vertexes along with their corresponding candidate services.
 It then generates all possible service combinations for the chosen vertexes.

diff --git a/metrics.tex b/metrics.tex
@@ -14,22 +14,23 @@ \subsection{Quality Metrics}\label{subsec:metrics}
 
 In this paper, we provide two metrics, one quantitative and one qualitative, that compare the input dataset \origdataset\ and dataset \transdataset\ generated by enforcing data protection requirements (i.e., our policy-driven transformation in Section~\cite{ADD}) on \origdataset\ at each step of the data pipeline.
 
-% \subsubsection{Jaccard coefficient}
-% The Jaccard coefficient is a quantitative metric that can be used to measure the difference between the elements in two datasets.
-% It is defined as:\[J(X,Y) = \frac{|X \cap Y|}{|X \cup Y|}\]
-% where X and Y are two datasets of the same size.
+\subsubsection{Quantitive metric}
+We propose a metric that enables the measurement of the similarity between two datasets, for this purpose, we use the Jaccard coefficient.
+The Jaccard coefficient is a quantitative metric that can be employed to assess the dissimilarity between the elements in two datasets.
+It is defined as follows:\[J(X,Y) = \frac{|X \cap Y|}{|X \cup Y|}\]
+where X and Y are two datasets of the same size.
 
-% The Jaccard coefficient is computed by dividing the cardinality of the intersection of two sets by the cardinality of their union. It ranges from 0 to 1, where 0 indicates no similarity and 1 indicates complete similarity between the datasets.
+The coefficient is calculated by dividing the cardinality of the intersection of two sets by the cardinality of their union. It ranges from 0 to 1, with 0 indicating no similarity and 1 indicating complete similarity between the datasets.
 
-% The Jaccard coefficient has several advantages. Unlike other similarity measures, such as Euclidean distance, it is not affected by the magnitude of the values in the dataset. It is suitable for datasets with categorical variables or nominal data, where the values do not have a meaningful numerical interpretation.
+This metric has several advantages. Unlike other similarity measures, such as Euclidean distance, it is not affected by the magnitude of the values in the dataset. It is suitable for datasets with categorical variables or nominal data, where the values do not have a meaningful numerical interpretation.
 
-\subsubsection{Weighted Jaccard coefficient}
-The Jaccard coefficient can be extended with weights modeling the importance of each element in the dataset.
-The Weighted Jaccard coefficient is defined as:\[J(X,Y) = \frac{\sum_{i=1}^{n}w_i(x_i \cap y_i)}{\sum_{i=1}^{n}w_i(x_i \cup y_i)}\]
+The Jaccard coefficient can be extended with weights that model the importance of each element in the dataset.
+it is defined as follows:\[\text{Weighted }J(X,Y) = \frac{\sum_{i=1}^{n}w_i(x_i \cap y_i)}{\sum_{i=1}^{n}w_i(x_i \cup y_i)}\]
 where X and Y are two datasets of the same size.
 
 It is computed by dividing the cardinality of the intersection of two datasets by the cardinality of their union, weighted by the importance of each element in the datasets. Weights prioritize certain elements (e.g., a specific feature) in the datasets.
 The Weighted Jaccard coefficent can then account for element importance and provide a more accurate measure of similarity.
+
 \subsubsection{Qualitative Metric}
 We propose a metric that enables the measurement of the distance of two distributions. The suggested metric is based on the well-known Jensen-Shannon Divergence, which is defined as follows:
 The Jensen-Shannon divergence (JSD) is a qualitative metric that can be used to measure the dissimilarity between the probability distributions of two datasets.
@@ -76,7 +77,7 @@ \subsection{NP-Hardness of the Max Quality Pipeline Instantiation Process}\label
 
   The comparison algorithm is then applied to the set of services $S'_*$ and it returns a ranking of the services.
   The ranking is based on the amount of data that is anonymized by the service.
-  The ranking is listed in \cref{tab:instance_example_maxquality} and it is based on the transformation function of the policies,
+  The ranking is listed in \cref{tab:instance_example_maxquality} (b) and it is based on the transformation function of the policies,
   assuming that a more restrictive transformation function anonymizes more data affecting negatively the position in the ranking.
   For example, \s{11} is ranked first because it anonymizes less data than \s{12} and \s{13}.
   The ranking of \s{22} and \s{23} is based on the same logic.
@@ -98,7 +99,7 @@ \subsection{Heuristic}\label{subsec:heuristics}
 %The computational challenge posed by the enumeration of all possible combinations within a given set is a well-established NP-hard problem.}
 %The exhaustive exploration of such combinations swiftly becomes impractical in terms of computational time and resources, particularly when dealing with the analysis of complex pipelines.
 %In response to this computational complexity, the incorporation of heuristic emerges as a strategy to try to efficiently address the problem.
-\hl{HO RIVISTO IL PARAGRAFO VELOCEMENTE GIUSTO PER DARE UN'INDICAZIONE. DOBBIAMO USARE LA FORMALIZZAZIONE E MAGARI FORMALIZZARE ANCHE LO PSEUDOCODICE.} We design and implement a heuristic algorithm for computing the pipeline instance maximizing data quality. Our heuristic is built on a \emph{sliding window} and aims to minimize information loss according to quality metrics. At each step, a set of vertexes in the pipeline template $\tChartFunction$ is selected according to a specific window w=[i,j], where $i$ and $j$ are the starting and ending depth of window w. Service filtering and selection in Section~\ref{sec:instance} are then executed to minimize information loss in window w. The heuristic returns as output the list of services instantiating vertexes at depth $i$. A new window w=[i+1,j+1] is considered until $j$+1 is equal to the max depth of $\tChartFunction$, that is the window reaches the end of the template.
+\hl{HO RIVISTO IL PARAGRAFO VELOCEMENTE GIUSTO PER DARE UN'INDICAZIONE. DOBBIAMO USARE LA FORMALIZZAZIONE E MAGARI FORMALIZZARE ANCHE LO PSEUDOCODICE.} We design and implement a heuristic algorithm for computing the pipeline instance maximizing data quality. Our heuristic is built on a \emph{sliding window} and aims to minimize information loss according to quality metrics. At each step, a set of vertexes in the pipeline template $\tChartFunction$ is selected according to a specific window size w=[i,j], where $i$ and $j$ are the starting and ending depth of window w. Service filtering and selection in Section~\ref{sec:instance} are then executed to minimize information loss in window w. The heuristic returns as output the list of services instantiating vertexes at depth $i$. A new window w=[i+1,j+1] is considered until $j$+1 is equal to the max depth of $\tChartFunction$, that is the window reaches the end of the template.
 %For example, in our service selection problem where the quantity of information lost needs to be minimized, the sliding window algorithm can be used to select services composition that have the lowest information loss within a fixed-size window.
 This strategy ensures that only services with low information loss are selected at each step, minimizing the overall information loss. Pseudo-code for the sliding window algorithm is presented in Algorithm 1.
 
@@ -116,7 +117,7 @@ \subsection{Heuristic}\label{subsec:heuristics}
   keepspaces=true,                 % keeps spaces in text, useful for keeping indentation of code (possibly needs columns=flexible)
   keywordstyle=\color{keywordsColor}\bfseries,       % keyword style
   language=Python,                 % the language of the code (can be overrided per snippet)
-  otherkeywords={*,to,function, Seq, add,empty},           % if you want to add more keywords to the set
+  otherkeywords={*,function, Seq, add,empty},           % if you want to add more keywords to the set
   numbers=left,                    % where to put the line-numbers; possible values are (none, left, right)
   numbersep=5pt,                   % how far the line-numbers are from the code
   numberstyle=\tiny\color{commentsColor}, % the style that is used for the line-numbers
@@ -131,7 +132,6 @@ \subsection{Heuristic}\label{subsec:heuristics}
   columns=fixed                    % Using fixed column width (for e.g. nice alignment)
 }
 
-
 \begin{lstlisting}[frame=single,mathescape, caption={Sliding Window Heuristic with Selection of First Service from Optimal Combination},label={lst:slidingwindowfirstservice}]
   selectedServices = empty
   for i from 0 to length(serviceCombinations):
@@ -150,7 +150,6 @@ \subsection{Heuristic}\label{subsec:heuristics}
                   firstService = serviceCombinations[j][0]
       add firstService to instance
   return instance
-
   \end{lstlisting}
 
 The pseudocode implemets function {\em SlidingWindowHeuristic}, which takes a sequence of vertexes and a window size as input and returns a set of selected vertexes as output. The function starts by initializing an empty set of selected vertexes (line 3). Then, for each node in the sequence (lines 4--12), the algorithm iterates over the vertexes in the window (lines 7--11) and selects the node with the lowest metric value (lines 9-11). The selected node is then added to the set of selected vertexes (line 12). Finally, the set of selected vertexes is returned as output (line 13).

diff --git a/pipeline_instance.tex b/pipeline_instance.tex
@@ -87,7 +87,6 @@ \section{Pipeline Instance}\label{sec:instance}
   \label{fig:service_composition_instance}
 \end{figure}
 
-\subsection{Example}\label{sec:example_instace} TBD (mettere un esempio in cui non sia stato scelto l'ottimo che invece cercheremo e metteremo come esempio nella prossima sezione)
 
 % \subsection{Pipeline Instance Definition}\label{sec:instancedefinition}
 % The goal of our approach is to generate an instance of the \pipelineTemplate starting from the \pipelineTemplate in Section~\ref{sec:template}. In the following, we first define the pipeline instance and the corresponding pipeline instantiation process (Section \ref{sec:instancedefinition}). We then prove that the pipeline instantiation process is NP-hard (Section \ref{sec:funcannotation}).

diff --git a/pipeline_instance_example.tex b/pipeline_instance_example.tex
@@ -1,3 +1,5 @@
+\subsection{Example}\label{sec:example_instace}
+
 \begin{example}[\bf \pipelineInstance]\label{ex:instance}
 
   Let us consider a subset \{\vi{5}, \vi{6}, \vi{7}\} of the pipeline template \tChartFunction in \cref{sec:example_instace}.
@@ -13,14 +15,12 @@
 
   For each vertex, we could select the initial matching servic from each set $S'_*$ and incorporate it into the instance.
   For instance, for \vi{6}, we select \s{61}; for \vi{7}, \s{72} is chosen, and for \vi{8}, \s{81} is the preferred option.
-  The instance thus formulated is depicted in \cref{tab:instance_example_valid}.
-  It is imperative to acknowledge that this instance is valid it satisfies all the policies in the pipeline template.
+  The instance thus formulated is depicted in \cref{tab:instance_example_valid} (a).
+  It is imperative to acknowledge that this instance is valid as it satisfies all the policies in the pipeline template.
   However, it does not represent the optimal instance achievable.
-  o determine the optimal instance, it is essential to evaluate services based on specific quality metrics that reflect their impact on data quality.
+  To determine the optimal instance, it is essential to evaluate services based on specific quality metrics that reflect their impact on data quality.
   In the next sections, we will introduce the metrics that we use to evaluate the quality of services and the results of the experiments conducted to evaluate the performance of our approach.
 
-
-
   % \begin{table*}
   %   \def\arraystretch{1.5}
   %   \caption{Instance example}\label{tab:instance_example}
@@ -64,7 +64,7 @@
           \multirow{ 3}{*}{\vi{6}  $\rightarrow$ \p{7},\p{8} } & $\s{61}$           & visualization\_location = "CT\_FACILITY" & \cmark             & \cmark            \\
                                                                & $\s{62}$           & visualization\_location = "CLOUD"        & \cmark             & \xmark            \\
         \end{tabular}
-         &
+                                   &
 
         \begin{tabular}{c|c}\label{tab:instance_example_maxquality}
 
@@ -80,7 +80,10 @@
           $\s{61}$           & 1                \\
           $\s{62}$           & 2                \\
         \end{tabular}
+        \\
+        (a) Valid Instance example & (b) Best Quality Instance example
       \end{tabular}
+
     }
   \end{table*}