Skip to content

Commit

Permalink
First rount
Browse files Browse the repository at this point in the history
  • Loading branch information
antongiacomo committed Nov 25, 2024
1 parent 40f261a commit e0299c0
Show file tree
Hide file tree
Showing 4 changed files with 14 additions and 11 deletions.
9 changes: 5 additions & 4 deletions major review2/experiment.tex
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ \subsection{Testing Infrastructure and Experimental Settings}\label{subsec:exper
The simulator first defines the pipeline template as a sequence of vertices, with $l$ the length of the pipeline template, and defines the size \windowsize\ of the sliding window, such that \windowsize$\leq$$l$. We recall that alternative vertices are modeled in different pipeline templates, while parallel vertices are not considered in our experiments since they only add a fixed execution time that is negligible and does not affect the performance and quality of our solution. Each vertex is associated with a (set of) policy that applies a filtering transformation that removes a given percentage of data.


The simulator then starts the instantiation process. At each step $i$, it selects the subset \{\vi{i},$\ldots$,$v_{\windowsize+i-1}$\} of vertices with their corresponding candidate services, and generates all possible service combinations. The simulator calculates quality $Q$ for all combinations and instantiates \vi{i} with service \sii{i} from the optimal combination with maximum $Q$. The window is shifted by 1 (i.e., $i$=$i$+1) and the instantiation process restarts. When the sliding window reaches the end of the pipeline template, that is, $v_{\windowsize+i-1}$$=$$\vi{l}$, the simulator computes the optimal service combination and instantiates the remaining vertices with the corresponding services. Figure~\ref{fig:execution_example} shows an example of a simulator execution with $i$$=$2 and \windowsize$=$3. Subset \{\vi{2},\vi{3},\vi{4}\} is selected, all combinations generated, and corresponding quality $Q$ calculated. Optimal service combination \{\sii{11},\sii{22},\sii{31}\} is retrieved and \vii{2} in the pipeline instance instantiated with \sii{11}.
The simulator then starts the instantiation process. At each step $i$, it selects the subset \{\vi{i},$\ldots$,$v_{\windowsize+i-1}$\} of vertices with their corresponding candidate services, and generates all possible service combinations. The simulator calculates quality $Q$ for all combinations and instantiates \vi{i} with service \sii{i} from the optimal combination with maximum $Q$. The window is shifted by 1 (i.e., $i$=$i$+1) and the instantiation process restarts. When the sliding window reaches the end of the pipeline template, that is, $v_{\windowsize+i-1}$$=$$\vi{l}$, the simulator computes the optimal service combination and instantiates the remaining vertices with the corresponding services. \cref{fig:execution_example} shows an example of a simulator execution with $i$$=$2 and \windowsize$=$3. Subset \{\vi{2},\vi{3},\vi{4}\} is selected, all combinations generated, and corresponding quality $Q$ calculated. Optimal service combination \{\sii{11},\sii{22},\sii{31}\} is retrieved and \vii{2} in the pipeline instance instantiated with \sii{11}.

\begin{figure}[!t]
\centering
Expand Down Expand Up @@ -136,7 +136,8 @@ \subsection{Testing Infrastructure and Experimental Settings}\label{subsec:exper
\caption{7 vertices}
\label{fig:time_window_perce_wide_7n}
\end{subfigure}
\label{fig:time_window_perce_average}
\caption{Evaluation of Performance Using the \emph{Qualitative} Metric in a \average Profile
\label{fig:time_window_perce_average}Configuration.}
\end{figure}

\subsection{Quality}\label{subsec:experiments_quality}
Expand All @@ -145,7 +146,7 @@ \subsection{Testing Infrastructure and Experimental Settings}\label{subsec:exper

We run our experiments varying: \emph{i)} the length $l$ of the pipeline template in [3,7], that is, the depth of the pipeline template as the number of vertices composed in a sequence, \emph{ii)} the window size \windowsize\ in [1,$l$], and \emph{iii)} the number of candidate services for each vertex in the pipeline template in [2, 7]. Each vertex is associated with a (set of) policy that applies a filtering transformation that either remove a percentage of data in $[0.5,0.8]$ (\average) or in $[0.2,1]$ (\wide).

\cref{fig:quality_window_perce} present our quality results using metric $M_J$ in \cref{subsec:metrics} for settings \wide and \average.
\cref{fig:quality_window_perce} presents our quality results using metric $M_J$ in \cref{subsec:metrics} for settings \wide and \average.
In general, we observe that the quality of our heuristic approach increases as the window size increases, providing a quality comparable to the exhaustive approach when the window size \windowsize\ approaches the length $l$ of the pipeline template.

When considering setting \wide, the baseline (\windowsize=1) provides good results on average (0.71, 0.90), while showing substantial quality oscillations in specific runs: between 0.882 and 0.970 for 3 vertices, 0.810 and 0.942 for 4 vertices, 0.580 and 0.853 for 5 vertices, 0.682 and 0.943 for 6 vertices, 0.596 and 0.821 for 7 vertices. This same trend emerges when the window size is $<$$l$/2, while it starts approaching the optimum when the window size is $\geq$$l$/2. For instance, when \windowsize=$l$-1, the quality varies between 0.957 and 1.0 for 3 vertices, 0.982 and 1.0 for 4 vertices, 0.986 and 0.998 for 5 vertices, 0.977 and 1.0 for 6 vertices, 0.996 and 1.0 for 7 vertices.
Expand Down Expand Up @@ -225,7 +226,7 @@ \subsection{Testing Infrastructure and Experimental Settings}\label{subsec:exper
When considering setting \average, the heuristic algorithm still provides good results, limiting the quality oscillations observed for setting \wide\ and approaching the quality of the exhaustive also for lower window sizes. The baseline (\windowsize=1) provides good results on average (from 0.842 to 0.944), as well as in specific runs: between 0.927 and 0.978 for 3 vertices, 0.903 and 0.962 for 4 vertices, 0.840 and 0.915 for 5 vertices, 0.815 and 0.934 for 6 vertices, 0.721 and 0.935 for 7 vertices.
When \windowsize=$l$-1, the quality varies between 0.980 and 1.0 for 3 vertices, 0.978 and 1.0 for 4 vertices, 0.954 and 1 for 5 vertices, 0.987 and 1.0 for 6 vertices, 0.990 and 1.0 for 7 vertices.

\cref{fig:quality_window_qualitative} present our quality results using metric $M_{JSD}$ in \cref{subsec:metrics} for settings \wide and \average, respectively.
\cref{fig:quality_window_qualitative} presents our quality results using metric $M_{JSD}$ in \cref{subsec:metrics} for settings \wide and \average, respectively.

When considering setting \wide, the baseline (\windowsize=1) provides good results on average (0.92, 0.97), limiting oscillations observed with metric $M_J$; for instance, the quality varies between 0.951 and 0.989 for 3 vertices, 0.941 and 0.988 for 4 vertices, 0.919 and 0.974 for 5 vertices, 0.911 and 0.971 for 6 vertices, 0.877 and 0.924 for 7 vertices.
The worst quality results are obtained with the baseline, while the oscillations are negligible when the window size is $>$2. For instance, when \windowsize=$l$-2, the quality varies between, 0.982 and 0.996 for 4 vertices, 0.981 and 0.998 for 5 vertices, 0.988 and 1.0 for 6 vertices, 0.976 and 0.999 for 7 vertices. When \windowsize=$l$-1, the quality varies between 0.987 and 0.998 for 3 vertices, 0.993 and 1.0 for 4 vertices, 0.985 and 0.999 for 5 vertices, 0.997 and 1.0 for 6 vertices, 0.995 and 1.0 for 7 vertices.
Expand Down
2 changes: 1 addition & 1 deletion major review2/main.tex
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
\maketitle

\begin{abstract}
~Today, the increasing ability of collecting and managing huge volume of data, coupled with a paradigm shift in service delivery models, has significantly enhanced scalability and efficiency in data analytics, particularly in multi-tenant environments. Data are today treated as digital products, which are managed and analyzed by multiple services orchestrated in data pipelines. {\color{OurColor} This paradigm shift towards distributed systems, structured as service-based data pipelines, lacks corresponding advancements in data governance techniques that effectively manage data throughout the pipeline lifecycle. This gap highlights the need for innovative data pipeline management solutions that prioritize balancing data quality with data protection. Departing from the state of the art that traditionally focuses on monolithic services and systems, and treats data protection and data quality as independent factors}, we propose a framework that enhance service selection and composition in service-based data pipelines to maximize data quality, while ensuring a minimum level of data protection. Our approach first retrieves a set of candidate services compatible with data protection requirements in the form of access control policies; it then selects the subset of compatible services, to be integrated within the data pipeline, which maximizes the overall data quality. Being our approach NP-hard, a sliding-window heuristic is defined and experimentally evaluated in terms of performance and quality with respect to the exhaustive approach {\color{OurColor}and a baseline modeling the state of the art}. Our results demonstrate a significant reduction in computational overhead, while maintaining high data quality.
~Today, the increasing ability of collecting and managing huge volume of data, coupled with a paradigm shift in service delivery models, has significantly enhanced scalability and efficiency in data analytics, particularly in multi-tenant environments. Data are today treated as digital products, which are managed and analyzed by multiple services orchestrated in data pipelines. {\color{OurColor} This paradigm shift towards distributed systems, structured as service-based data pipelines, lacks corresponding advancements in data governance techniques that effectively manage data throughout the pipeline lifecycle. This gap highlights the need for innovative data pipeline management solutions that prioritize balancing data quality with data protection. Departing from the state of the art that traditionally focuses on monolithic services and systems, and treats data protection and data quality as independent factors}, we propose a framework that can enhance service selection and composition in service-based data pipelines to maximize data quality, while ensuring a minimum level of data protection. Our approach first retrieves a set of candidate services compatible with data protection requirements in the form of access control policies; it then selects the subset of compatible services, to be integrated within the data pipeline, which maximizes the overall data quality. Being our approach NP-hard, a sliding-window heuristic is defined and experimentally evaluated in terms of performance and quality with respect to the exhaustive approach {\color{OurColor}and a baseline modeling the state of the art}. Our results demonstrate a significant reduction in computational overhead, while maintaining high data quality.
\end{abstract}

\tikzset{
Expand Down
12 changes: 7 additions & 5 deletions major review2/readme.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
- [ ] The Abstract is not concise, and some of the parts could be moved to the Introduction section.
- [ ] There are some assumptions in the paper. For some assumptions, the paper does not provide the support/comments. The readers may wonder if the assumptions are true in practical scenarios. It could be better if the paper can provide some support or some comments about the assumptions.
- [ ] The caption of Fig. 6 is missing. Also, Fig. 6 is not referred to in the paper.

- [ ] The evaluation for the proposed approach is still not enough. The performance of the proposed approach should be extensively validated. Although, the work has tested its performance on some metrics, the proposed system should be fully tested on various evaluation metrics in various scenarios so that the real performance can be verified. Considering that the real-world application is more complex, the readers could wonder if the proposed approach is practical to be applied to real-world applications (or how practical the proposed method can be applied to real-world applications). I understand that I may introduce some extra work. In the future work, the authors may consider it.
- [ ] There are some parameters in the algorithms, and the performance of the proposed solution could be affected by the settings of the parameters. The readers may wonder if the paper needs some sensitivity testing so that it can better reflect the real performance of the proposed solution. This can help to fully verify the performance of the proposed solution, and enhance the quality of work.

- [ X ] ~~The caption of Fig. 6 is missing. Also, Fig. 6 is not referred to in the paper.~~
- [ ] The paper needs proofreading, and the presentation of the paper could be improved. There are some typos or grammar errors in the paper. Here are some examples.
- [ ] Page 2, Abstract, line 2: “enhance” could be changed to “can enhance”
- [ ] Page 31, paragraph 3: “present” could be changed to “presents”
- [ ] Page 33, paragraph 2: “present” could be changed to “presents”
- [ X ] ~~Page 2, Abstract, line 2: “enhance” could be changed to “can enhance”~~
- [ X ] ~~Page 31, paragraph 3: “present” could be changed to “presents”~~
- [ X ] ~~Page 33, paragraph 2: “present” could be changed to “presents”~~
- [ ] Some figures (e.g., Figure 6, Figure 7, Figure 8, ) in the paper are small (e.g., font size) in the printed version, which makes it a little difficult for the readers to clearly see the content in the figures.
- [ ] Some terms in the paper are not always consistent. For example, sometimes the paper uses “Fig.”, but sometimes it uses “Figure”.
- [ X ] ~~Some terms in the paper are not always consistent. For example, sometimes the paper uses “Fig.”, but sometimes it uses “Figure”.~~
2 changes: 1 addition & 1 deletion major review2/system_model.tex
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ \subsection{Reference Scenario}\label{sec:service_definition}

\end{table*}

The user's objective aligns with the predefined service pipeline in Figure \ref{fig:reference_scenario} that orchestrates the following sequence of operations:
The user's objective aligns with the predefined service pipeline in \cref{fig:reference_scenario} that orchestrates the following sequence of operations:
\begin{enumerate*}[label=(\roman*)]
\item \emph{Data fetching}, including the download of the dataset from other states;
\item \emph{Data preparation}, including data merging, cleaning, and anonymization;
Expand Down

0 comments on commit e0299c0

Please sign in to comment.