diff --git a/introduction.tex b/introduction.tex index bce34dc..ff4773e 100644 --- a/introduction.tex +++ b/introduction.tex @@ -7,21 +7,30 @@ \section{Introduction} As a consequence, achieving a balance between data protection and data quality is crucial, as the removal or alteration of personally identifiable information from datasets to safeguard individuals' privacy can compromise the accuracy of analytics results. So far, all research endeavors have been concentrated on exploring these two issues separately: on one hand, the concept of data quality, encompassing accuracy, reliability, and suitability, has been investigated to understand the implications in analytical contexts. Although extensively studied, these investigations often prioritize enhancing the quality of source data rather than ensuring data quality throughout the entire processing pipeline, or the integrity of outcomes derived from data. On the other hand, there is a focus on data privacy and security, entailing the protection of confidential information and adherence to rigorous privacy regulations. -There are very few solutions that that find a good balance between them since it requires a holistic approach that integrates technological solutions, organizational policies, and ongoing monitoring and adaptation to emerging threats and regulatory changes. A valid solution should implement robust access control mechanisms, ensuring that only authorized users can access specific datasets or analytical tools. -Additionally, data protection requirements should be identified at each stage of the data lifecycle, potentially incorporating techniques like data masking and anonymization to safeguard sensitive information by substituting it with realistic but fictional data, thereby preserving data privacy while enabling analysis. An ideal solution should prioritize data lineage, fostering a comprehensive understanding and optimization of data flows and transformations within complex analytical ecosystems. -To this aim, we propose a data governance framework tailored to contemporary data-driven pipelines, which aims to limit the privacy and security risks. The primary objective of this framework is to facilitate the assembly of data processing services, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements. +A valid solution requires a holistic approach that integrates technological solutions, organizational policies, and ongoing monitoring and adaptation to emerging threats and regulatory changes. The implementation of robust access control mechanisms, ensuring that only authorized users can access specific datasets or analytical tools is just an essential initial step. +Indeed, we identified some additional requirements. First, data protection requirements should be identified at each stage of the data lifecycle, potentially incorporating techniques like data masking and anonymization to safeguard sensitive information, thereby preserving data privacy while enabling sharing and analysis. Second, data lineage should be prioritized, fostering a comprehensive understanding and optimization of data flows and transformations within complex analytical ecosystems. -Each service in the pipeline is tagged with \textit{annotations} to specify data protection requirements expressing transformation on data to enforce data protection, as well as functional specifications on services expressing data manipulations carried out during services execution. There is a catalog of services among which a user can choose. Services may be functionally equivalent (i.e., they perform the same task), but have different security policies, more or less restrictive\st{, depending on the provider's attributes}. Thus a user si trova a avere tutto l'interesse di capire come le sue scelte service selection possano avere un impatto sulla qualità del risultato finale. Our service selection approach focuses on maximizing data quality by retaining the maximum amount of information when applying data protection transformations; - - -The key contributions are as follows: +When evaluating a solution meeting these criteria, the following questions naturally arise: \begin{enumerate} - \item Service enrichment (metadati) - \item The composite selection problem is NP-hard, but we present a parametric heuristic tailored to address the computational - complexity. We evaluated the performance and quality of the algorithm by running some experiments on a dataset. +\item How does a robust data protection policy affect analytics? +%What impact does a robust and strong data protection policy have on analytics? +%How does a strong data protection policy influence analytics? +\item When considering a pipeline, is it more advantageous to perform data protection transformations at each step rather than filtering all data at the outset? +\item In a scenario where a user has the option to choose among various candidate services, how might these choices affect the analytics? \end{enumerate} +% +Based on the aforementioned considerations, we propose a data governance framework customized for modern data-driven pipelines, designed to mitigate privacy and security risks. The primary objective of this framework is to facilitate the assembly of data processing services, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements. -The rest of the paper is organized as follows. %In Section \ref{sec:requirements} In Section \ref{} In Section \ref{} +In our solution, each element in the pipeline is tagged with \textit{annotations} to specify data protection requirements expressing transformation on data to enforce data protection, as well as functional specifications on services expressing data manipulations carried out during services execution. For each element in the pipeline, there is a catalog of candidate services among which the user running it can choose. Services may be functionally equivalent (i.e., they perform the same tasks), but have different security policies, more or less restrictive depending on the organization or service provider they belong to. +Our data governance approach is based on the belief that maintaining a larger volume of data leads to higher data quality, thus our service selection approach focuses on maximizing data quality by retaining the maximum amount of information when applying data protection transformations. +We have a running example on which we conducted experiments to provide an initial answer to the questions highlighted earlier. -ecosystem of services +The primary contributions of the paper can be summarized as follows: +\begin{enumerate} + \item Enriching services with meta-data that describes both data protection and functional requirements; + \item Proposing a parametric heuristic tailored to address the computational complexity of the NP-hard service selection problem; + \item Evaluating the performance and quality of the algorithm through experiments conducted using the dataset from the running example. +\end{enumerate} +% +The remainder of the paper is structured as follows: Section 2, presents our system model, illustrating a reference scenario where data is owned by multiple organizations. In Section \ref{sec:template}, we introduce the pipeline template and describe the annotations. In Section \ref{sec:instance}, we describe the process of template instantiaton. In Section \ref{sec:heuristics}, we introduce the quality metrics used by the selection algorithm and the heuristic solving the service selection problem. Section \ref{sec:experiment} outlines the experiments conducted. Section \ref{sec:related} discusses the existing solutions. Section \ref{sec:conclusions} concludes the paper. \ No newline at end of file diff --git a/main.tex b/main.tex index 9213f49..80b4d17 100644 --- a/main.tex +++ b/main.tex @@ -109,7 +109,8 @@ \input{experiment} \input{related} -\section{Conclusions} +\section{Conclusions}\label{sec:conclusions} + \clearpage %\bibliographystyle{spbasic} % basic style, author-year citations \bibliographystyle{spmpsci} % mathematics and physical sciences diff --git a/pipeline_instance.tex b/pipeline_instance.tex index 13a7370..c90042a 100644 --- a/pipeline_instance.tex +++ b/pipeline_instance.tex @@ -7,7 +7,7 @@ \section{Pipeline Instance}\label{sec:instance} \begin{enumerate*}[label=\textit{\roman*})] \item $v'_r$$=$$v_r$; \item for each vertex $\vi{f}$ modeling a parallel structure, there exists a corresponding vertex $\vii{f}$; - \item for each $\vi{i}$$\in$$\V_S$ annotated with policy \P{i} (label \myLambda(\vi{i})) and functional description $F_i$(label \myGamma(\vi{i})), there exists a corresponding vertex \vii{i}$\in$$\Vp_S$ instantiated with a service \sii{i}, such that: + \item for each $\vi{i}$$\in$$\V_S$ annotated with policy \P{i} (label \myLambda(\vi{i})) and functional description $F_i$ (label \myGamma(\vi{i})), there exists a corresponding vertex \vii{i}$\in$$\Vp_S$ instantiated with a service \sii{i}, such that: \end{enumerate*} \begin{enumerate}[label=\arabic*)] \item $s'_i$ satisfies data protection annotation \myLambda(\vi{i}) in \tChartFunction; diff --git a/pipeline_template.tex b/pipeline_template.tex index ca4acb8..2c328e0 100644 --- a/pipeline_template.tex +++ b/pipeline_template.tex @@ -1,9 +1,9 @@ \section{Pipeline Template}\label{sec:template} Our approach integrates data protection and data management into the service pipeline using annotations. -To this aim, we extend the service pipeline in \cref{def:pipeline} with: \emph{i)} data protection annotations to express transformations on data, ensuring compliance with data protection requirements, \emph{ii)} functional annotations to express data manipulations carried out during services execution. +To this aim, we extend the service pipeline in \cref{def:pipeline} with: \emph{i)} data protection annotations that also express transformations on data, ensuring compliance with data protection requirements, \emph{ii)} functional annotations to express data manipulations carried out during services execution. These annotations enable the implementation of an advanced data lineage, tracking the entire data lifecycle by monitoring changes that result from functional service execution and data protection requirements. -In the following, we first introduce the annotated service pipeline, called pipeline template (Section \ref{sec:templatedefinition}). We then present functional annotations (Section \ref{sec:funcannotation}) and data protection annotations (Section \ref{sec:nonfuncannotation}). We finally provide an example of a pipeline template (Section \ref{sec:example_template}). +In the following, we first introduce the annotated service pipeline, called pipeline template (Section \ref{sec:templatedefinition}). Then, we present both functional annotations (Section \ref{sec:funcannotation}) and data protection annotations (Section \ref{sec:nonfuncannotation}), providing an example of a pipeline template in the context of the reference scenario. \subsection{Pipeline Template Definition}\label{sec:templatedefinition} @@ -17,16 +17,16 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition} The template is formally defined as follows. \begin{definition}[Pipeline Template] \label{def:template} - Given a service pipeline G(\V,\E), a pipeline template $G^{\myLambda,\myGamma}$(V,E,\myLambda,\myGamma) is a direct acyclic graph extedend with two annotation functions: + Given a service pipeline G(\V,\E), a pipeline template \tChartFunction is a direct acyclic graph extedend with two annotation functions: \begin{enumerate}%[label=\textit{\roman*}] - \item \emph{Data Protection Annotation} \myLambda that assigns a label \myLambda(\vi{i}) to each vertex $\vi{i}\in\V_S$. Label \myLambda(\vi{i}) corresponds to a set \P{i} of policies $p_j$ to be satisfied by service $s_i$ represented by \vi{i}, ; + \item \emph{Data Protection Annotation} \myLambda that assigns a label \myLambda(\vi{i}) to each vertex $\vi{i}\in\V_S$. Label \myLambda(\vi{i}) corresponds to a set \P{i} of policies $p_j$ to be satisfied by service $s_i$ represented by \vi{i}; \item \emph{Functional Annotation} \myGamma that assigns a label \myGamma(\vi{i}) to each vertex $\vi{i}\in\V_S$. Label \myGamma(\vi{i}) corresponds to the functional description $F_i$ of service $s_i$ represented by \vi{i}. \end{enumerate} \end{definition} We note that, at this stage, the template is not yet linked to any service. We also note that policies $p_j$$\in$\P{i} in \myLambda(\vi{i}) are combined using logical OR, meaning that the access decision is positive if at least one policy $p_j$ evaluates to \emph{true}. %We also note that functional description $F_i$ includes the specific data transformation triggered as the result of a service execution. - An example of pipeline template is depicted in \cref{fig:service_composition_template} + The pipeline template of the service pipeline of \cref{fig:reference_scenario} is depicted in \cref{fig:service_composition_template}. %The next sections better explain the functional and non-functional transformation functions. \begin{figure}[ht!] @@ -102,7 +102,7 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition} A {\it policy p}$\in$\P{} is 5-uple $<$\textit{subj}, \textit{obj}, \textit{act}, \textit{env}, \textit{\TP}$>$ that specifies who (\emph{subject}) can access what (\emph{object}) with action (\emph{action}), in a specific context (\emph{environment}) and under specific obligations (\emph{data transformation}). \end{definition} - More in detail, \textit{subject subj} defines a service $s_i$ issuing an access request to perform an action on an object. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, $<$\{(classifier $=$ "SVM")\}$>$ specifies a service providing a SVM classifier. We note that \textit{subj} can also specify conditions on the service owner ($<$\{(owner\_location $=$ "EU")\}$>$) and the service user ($<$\emph{service},\{(service\_user\_role $=$ "DOC Director")\}$>$). + More in detail, \textit{subject subj} specifies a service $s_i$ issuing an access request to perform an action on an object. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, $<$\{(classifier $=$ "SVM")\}$>$ specifies a service providing a SVM classifier. We note that \textit{subj} can also specify conditions on the service owner ($<$\{(owner\_location $=$ "EU")\}$>$) and the service user ($<$\emph{service},\{(service\_user\_role $=$ "DOC Director")\}$>$). %\item \textit{Object obj} defines those data whose access is governed by the policy. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. @@ -116,11 +116,11 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition} \textit{Environment env} defines a set of conditions on contextual attributes, such as time of the day, location, IP address, risk level, weather condition, holiday/workday, emergency. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, $<$\{(time $=$ "night")\}$>$ refers to a policy that is applicable only at night. %\item - \textit{Data Transformation \TP} defines a set of security and privacy-aware transformations on \textit{obj}, which must be enforced before any access to data. Transformations focus on data protection, as well as compliance to regulations and standards, in addition to simple format conversions. For instance, let us define three transformations that can be applied to the dataset in \cref{tab:dataset}: + \textit{Data Transformation \TP} defines a set of security and privacy-aware transformations on \textit{obj} that must be enforced before any access to data is given. Transformations focus on data protection, as well as on compliance to regulations and standards, in addition to simple format conversions. For instance, let us define three transformations that can be applied to the dataset in \cref{tab:dataset}: \begin{enumerate*}[label=\roman*)] \item \emph{level0} (\tp{0}): no anonymization; - \item \emph{level1} (\tp{1}): partial anonymization with only the first name and last name being anonymized; - \item \emph{level2} (\tp{2}): full anonymization with the first name, last name, identifier, and age being anonymized. + \item \emph{level1} (\tp{1}): partial anonymization with only first and last name being anonymized; + \item \emph{level2} (\tp{2}): full anonymization with first name, last name, identifier and age being anonymized. \end{enumerate*} %\end{description} @@ -148,7 +148,9 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition} % %\end{description} - To conclude, access control policies $p_j$$\in$\P{i} annotating vertex \vi{i} in a pipeline template $G^{\myLambda,\myGamma}$ filters out those candidate services $s$$\in$$S^c$ that do not match data protection requirements. Specifically, each policy $p_j$$\in$\P{i} verifies whether a candidate service $s$$\in$$S^c$ for vertex \vi{i} is compatible with data protection requirements in \myLambda(\vi{i}). Policy evaluation matches the profile \profile\ of a candidate service $s$$\in$$S^c$ with the policy conditions in each $p_j$$\in$\P{i}. If the credentials and declarations, defined as a set of attributes in the form (\emph{name}, \emph{value}), in the candidate service profile fails to meet the policy conditions, meaning that no policies $p_j$ evaluates to \emph{true}, the service is discarded; otherwise it is added to the set $S'$ of compatible service, which is used in Section~\ref{sec:instance} to generate the pipeline instance $G'$. No policy enforcement is done at this stage. +\textcolor{red}{Access control policies $p_j$$\in$\P{i} annotating a vertex \vi{i} in a pipeline template $G^{\myLambda,\myGamma}$ specify the data protection requirements that candidate service must fulfill in order to be selected in the pipeline instance. In Section~\ref{sec:instance} we describe the selection process.} + + % To conclude, access control policies $p_j$$\in$\P{i} annotating vertex \vi{i} in a pipeline template $G^{\myLambda,\myGamma}$ filters out those candidate services $s$$\in$$S^c$ that do not match data protection requirements. Specifically, each policy $p_j$$\in$\P{i} verifies whether a candidate service $s$$\in$$S^c$ for vertex \vi{i} is compatible with data protection requirements in \myLambda(\vi{i}). Policy evaluation matches the profile \profile\ of a candidate service $s$$\in$$S^c$ with the policy conditions in each $p_j$$\in$\P{i}. If the credentials and declarations, defined as a set of attributes in the form (\emph{name}, \emph{value}), in the candidate service profile fails to meet the policy conditions, meaning that no policies $p_j$ evaluates to \emph{true}, the service is discarded; otherwise it is added to the set $S'$ of compatible service, which is used in Section~\ref{sec:instance} to generate the pipeline instance $G'$. No policy enforcement is done at this stage. \subsection{Functional Annotations}\label{sec:funcannotation} A proper data management approach must track functional data manipulations across the entire pipeline execution, defining the functional requirements of each service operating on data. diff --git a/pipeline_template_example.tex b/pipeline_template_example.tex index 41de5be..5ddb94f 100644 --- a/pipeline_template_example.tex +++ b/pipeline_template_example.tex @@ -36,18 +36,18 @@ \end{table*} \begin{example}[\bf \pipelineTemplate]\label{ex:template} -Let us consider our reference scenario in \cref{sec:systemmodel}. +Let us consider the reference scenario introduced in \cref{sec:systemmodel}. \cref{fig:service_composition_template} presents an example of pipeline template consisting of five stages, each one annotated with a policy in \cref{tab:anonymization}. % We recall that \cref{tab:dataset} shows a sample of our reference dataset. % 1° NODO % The first stage consists of three parallel vertices \vi{1}, \vi{2}, \vi{3} for data collection. Data protection annotations \myLambda(\vi{1}), \myLambda(\vi{2}), \myLambda(\vi{3}) refer to policy \p{0} with an empty transformation \tp{0}. -Functional requirement \F{1}, \F{2}, \F{3} prescribes a URI as input and the corresponding dataset as output. +Functional requirements \F{1}, \F{2}, \F{3} prescribe a URI as input and the corresponding dataset as output. The second stage consists of vertex \vi{4}, merging the three datasets obtained at the first stage. Data protection annotation \myLambda(\vi{4}) refers to policies \p{1} and \p{2}, which apply different data transformations depending on the relation between the dataset and the service owner. % 2° NODO % -If the service owner is also the dataset owner (\pone), the dataset is not anonymized (\tp{0}). If the service owner is a partner of the dataset owner (\ptwo), the dataset is anonymized at level $l_1$ (\tp{1}). If the service owner has no partner relationship with the dataset owner, no policies apply. +If the service owner is also the dataset owner (\pone), the dataset is not anonymized (\tp{0}). If the service owner is a partner of the dataset owner (\ptwo), the dataset is anonymized at \emph{level1} (\tp{1}). If the service owner has no partner relationship with the dataset owner, no policy applies. %if the service owner is neither the dataset owner nor a partner of the dataset owner (\pthree), the dataset is anonymized level2 (\tp{2}). Functional requirement \F{4} prescribes $n$ datasets as input and the merged dataset as output. diff --git a/system_model.tex b/system_model.tex index 9f6fea9..75f751b 100644 --- a/system_model.tex +++ b/system_model.tex @@ -1,67 +1,40 @@ -\section{System Model and Service Pipeline}\label{sec:requirements} -\st{Big data is highly dependent on cloud-edge computing, which makes extensive use of multitenancy. -Multitenancy permits sharing one instance of infrastructures, platforms or applications by multiple tenants to optimize costs. This leads to common scenarios where a service provider offers subscription-based analytics capabilities in the cloud, or a single data lake is accessed by multiple customers. Big data pipelines then mix data and services which belong to various organizations, posing a serious risk of potential privacy and security violations. -We propose a data governance framework tailored to contemporary data-driven pipelines, which aims to limit the privacy and security risks. The primary objective of this framework is to facilitate the assembly of data processing services, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements.} - -In the following of this section, we present our system model (Section \ref{sec:systemmodel}) and our reference scenario (Section \ref{sec:service_definition}). +\section{System Model and \textcolor{red}{Reference Scenario}}\label{sec:requirements} +In the following of this section, we present our system model (Section \ref{sec:systemmodel}) and our reference scenario (Section \ref{sec:service_definition}) \textcolor{red}{that consists of a service pipeline for analyzing a dataset of individuals awaiting trial detained in the Department of Correction facilities in the state of Connecticut}. \subsection{System Model}\label{sec:systemmodel} -\st{In today's data landscape, the coexistence of data quality and data privacy is critical to support high-value services and pipelines. The increase in data production, collection, and usage has led to a split in scientific research priorities. -%This has resulted in two main focus areas. -First, researchers are exploring methods to optimize the usage of valuable data. Here, ensuring data quality is vital, and requires accuracy, reliability, and soundness for analytical purposes. -Second, there is a need to prioritize data privacy and security. This involves safeguarding confidential information and complying with strict privacy regulations. These two research directions are happening at the same time, though there are not many solutions that find a good balance between them. - -Our approach seeks to harmonize these objectives by establishing a data governance framework that balances privacy and data quality. } -Our system model is derived by a generic big-data framework and is composed of the following parties: +We consider a service-based environment where a service pipeline is designed to analyze data. Our system model is derived by a generic big-data framework \textcolor{red}{enriched with metadata specifying data protection requirements and functional specifications. } It is composed of the following parties: \begin{description} - \item[Service,] a software distributed by a \textbf{service provider} that performs a specific task \st{according to access control privileges on data}; %, a service can be tagged with some policies %, a service is characterized by two function: the service function and the policy function. - \item[Pipeline,] a sequence of connected services that collect, prepare, process, and analyze data in a structured and automated manner. \st{We distinguish between a \textbf{pipeline template} that acts as a skeleton, specifying the structure of the pipeline and the (non-)functional requirements driving service selection and composition, and a \textbf{pipeline instance} instantiating the template with services according to the specified requirements}; - \item[Data Governance Policy,] a structured set of privacy guidelines, rules, and procedures regulating data access and protection; - \item[User,] executing an analytics pipeline on the data. We assume that the data target of the analytics pipeline are ready for analysis, i.e., they underwent a preparatory phase addressing issues such as missing values, outliers, and formatting discrepancies. This ensures that the data are in an optimal state for subsequent analysis. + \item[Service,] a software distributed by a \textbf{service provider} that performs a specific task; + \item[Pipeline,] a sequence of connected services that collect, prepare, process, and analyze data in a structured and automated manner; + \item[Data Governance Policy,] a structured set of privacy guidelines, rules, and procedures regulating data access and protection. \textcolor{red}{In particular, each component service in the pipeline is annotated with data protection requirements and functional specifications.} + \item[User,] executing an analytics pipeline on the data. + \textcolor{red}{ \item[Dataset,] data target of the analytics pipeline. We assume they are ready for analysis, i.e., they underwent a preparatory phase addressing issues such as missing values, outliers, and formatting discrepancies. This ensures that the data are in an optimal state for subsequent analysis.} \end{description} - -the annotated pipeline, called \textit{pipeline template}, now acts as a skeleton, specifying the structure of the pipeline and both the functional and non-functional requirements; - -the \textit{pipeline instance} is built by instantiating the template with services according to the specified requirements. - -We distinguish between a \textbf{pipeline template} that acts as a skeleton, specifying the structure of the pipeline, i.e., the chosen sequence of desired services, and both the functional and non-functional requirements driving service selection and composition, and a \textbf{pipeline instance} instantiating the template with services according to the specified requirements. -The \user first selects a pipeline template among a set of functionally-equivalent templates according to its non-functional requirements. It then instantiates the template in a pipeline instance. To this aim, for each component service in the template, it retrieves a set of candidate services that satisfy the functional requirements of the component service. Candidate services are filtered to retrieve a list of compatible services that comply with the policies specified in the template. - -%The \user starts its analytics by first selecting a pipeline template among a set of functionally-equivalent templates. The template is selected according to the \user\ non-functional requirements and then instantiated in a pipeline instance. In particular, for each component service in the template, a real service is selected among a list of compatible services in the instance. Compatible services are functionally equivalent and comply with the privacy policies specified in the template. -The \user first selects a pipeline template among a set of functionally-equivalent templates according to its non-functional requirements. It then instantiates the template in a pipeline instance. To this aim, for each component service in the template, it retrieves a set of candidate services that satisfy the functional requirements of the component service. Candidate services are filtered to retrieve a list of compatible services that comply with the policies specified in the template. - -Compatible services are ranked based on their ability to retain the maximum amount of information (\emph{data quality} in this paper), while maintaining a minimum level of privacy; the best service is then selected to instantiate the corresponding component service in the template. -Upon selecting the most suitable service for each component service in the pipeline template, the pipeline instance is completed and ready for execution. -It is important to note that our data governance approach builds on the following assumption: \emph{upholding a larger quantity of data is linked to better data quality.} -While this assumption is not true in all settings, it correctly represents many real-world scenarios. We leave a solution that departs from this assumption to our future work. - -\subsection{Service Pipeline and Reference Scenario}\label{sec:service_definition} -We consider a service-based environment where a service pipeline is designed to analyze data. +% We define a service pipeline as a graph defined as follows. % and depicted in \cref{fig:service_pipeline}. \begin{definition}[\pipeline]\label{def:pipeline} % A \pipeline is as a direct acyclic graph G(\V,\E), where \V\ is a set of vertices and \E\ is a set of edges connecting two vertices \vi{i},\vi{k}$\in$\V. The graph has a root \vi{r}$\in$\V, a vertex \vi{i}$\in$\V$_S$ for each service $s_i$, two additional vertices \vi{c},\vi{m}$\in$\V$_{\timesOperator}$$\subset$\V\ for each alternative ($\timesOperator$) structure modeling the alternative execution (\emph{choice}) of operations and the retrieval (\emph{merge}) of the results, respectively, and one additional vertex \vi{f} $\in$\V$_{\plusOperator}$$\subset$\V\ for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of operations. A \pipeline is as a direct acyclic graph G(\V,\E), where \V\ is a set of vertices and \E\ is a set of edges connecting two vertices \vi{i},\vi{k}$\in$\V. - The graph has a root \vi{r}$\in$\V, a vertex \vi{i}$\in$\V$_S$ for each service $s_i$, an additional vertex \vi{f}$\subset$\V\ for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of services. + The graph has a root \vi{r}$\in$\V, a vertex \vi{i}$\in$\V$_S$ for each service $s_i$, an additional vertex \vi{f}$\in$\V\ for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of services. \end{definition} -We note that \{\vi{r},\vi{f}\}$\cup$\V$_S$$=$\V, vertices \vi{f} model branching for parallel structures, and root \vi{r} possibly represents the orchestrator. We also note that, for simplicity but no lack of generality, alternative structures modeling the alternative execution of services are specified as alternative service pipelines, that is, there is no alternative structure in a single service pipeline. +We note that \V$=$\{\vi{r},\vi{f}\}$\cup$\V$_S$, with vertices \vi{f} modeling branching for parallel structures, and root \vi{r} possibly representing the orchestrator. In addition, for simplicity but no lack of generality, alternative structures modeling the alternative execution of services are specified as alternative service pipelines, that is, there is no alternative structure in a single service pipeline. -% A service pipeline is as a direct acyclic graph G(\V,\E), where \V\ is a set of vertices, one for each service $s_i$ in the pipeline, \E\ is a set of edges connecting two services $s_i$ and $s_j$, and \myLambda\ is an annotation function that assigns a label \myLambda(\vi{i}), corresponding to a data transformation \F\ implemented by the service $s_i$, for each vertex \vi{i}$\in$\V. -Our reference scenario considers a service pipeline analyzing a dataset of individuals detained in Department of Correction facilities in the state of Connecticut while awaiting trial \cite{toadd}. -In particular, the user, a member of the Connecticut Department of Correction (DOC), seeks to compare admission trends in Connecticut prisons with DOCs in New York and New Hampshire. We assume DOCs to be partners and share data according to their privacy policies. -The user's preferences align with a predefined pipeline template that orchestrates the following sequence of operations: -\begin{enumerate*}[label=(\roman*)] - \item \emph{Data fetching}, including the download of the dataset from other states; - \item \emph{Data preparation}, including data merging, cleaning, and anonymization; - % \hl{QUESTO E' MERGE (M). IO PENSAVO DIVENTASSE UN NODO $v_i$. NEL CASO CAMBIANDO LA DEFINIZIONE 3.1 DOVE NON ESISTONO PIU' I NODI MERGE E JOIN.} - \item \emph{Data analysis}, including statistical measures like average, median, and clustering-based statistics; - \item \emph{Data storage}, including the storage of the results; - \item \emph{Data visualization}, including the visualization of the results. -\end{enumerate*} +We refer to the service pipeline, annotated with both functional and non-functional requirements, as the \textbf{pipeline template}. It acts as a skeleton, specifying both the structure of the pipeline, i.e., the chosen sequence of desired services, and the functional and non-functional requirements. \textcolor{red}{Given our context of a multi-tenant cloud-based ecosystem, each element within the pipeline may have a catalog of potential candidate services. Consequently, a \textbf{pipeline instance} is instantiated by selecting the most suitable candidates from the pipeline template.} -We note that the template requires the execution of the entire service within the Connecticut Department of Correction. If the data needs to be transmitted beyond the boundaries of Connecticut, data protection measures must be implemented. A visual representation of the flow is presented in Figure \ref{fig:reference_scenario}. +This process involves retrieving a set of compatible services for each component service in the template, ensuring that each service meets the functional requirements and aligns with the policies specified in the template. Since we also consider security policies that may necessitate security and privacy-aware data transformations, compatible services are ranked based on their capacity to fulfill the policy while preserving the maximum amount of information (\emph{data quality} in this paper). Indeed, our data governance approach operates under the assumption that \textit{preserving a larger quantity of data correlates with enhanced data quality}, a principle that represents many real-world scenarios. However, we acknowledge that this assumption may not universally apply and remain open to exploring alternative solutions in future endeavors. % -\cref{tab:dataset} presents a sample of the adopted dataset.\footnote{https://data.ct.gov/Public-Safety/Accused-Pre-Trial-Inmates-in-Correctional-Faciliti/b674-jy6w} Each row represents an inmate; each column includes the following attributes: date of download, a unique identifier, last entry date, race, gender, age of the individual, the bound value, offense, entry facility, and detainer. To serve the objectives of our study, we have extended this dataset by introducing randomly generated first and last names. +The best service is then selected to instantiate the corresponding component service in the template. +Upon selecting the most suitable service for each component service in the pipeline template, the pipeline instance is completed and ready for execution. + +%This because our data governance approach builds on the following assumption: \emph{upholding a larger quantity of data is linked to better data quality.} +%While this assumption is not true in all settings, it correctly represents many real-world scenarios. We leave a solution that departs from this assumption to our future work. + +\subsection{Reference Scenario}\label{sec:service_definition} + +Our reference scenario considers a service pipeline analyzing a dataset of individuals detained in the Department of Correction facilities in the state of Connecticut while awaiting trial\footnote{https://data.ct.gov/Public-Safety/Accused-Pre-Trial-Inmates-in-Correctional-Faciliti/b674-jy6w}. + +\cref{tab:dataset} presents a sample of the adopted dataset. Each row represents an inmate; each column includes the following attributes: date of download, a unique identifier, last entry date, race, gender, age of the individual, the bound value, offense, entry facility, and detainer. To serve the objectives of our study, we extended this dataset by introducing randomly generated first and last names. \begin{table*}[ht!] \caption{Dataset sample} @@ -87,6 +60,22 @@ \subsection{Service Pipeline and Reference Scenario}\label{sec:service_definitio \end{adjustbox} \end{table*} + +In this context, the user, a member of the Connecticut Department of Correction (DOC), is interested to compare admission trends in Connecticut prisons with the ones in New York and New Hampshire. We assume that the three DOCs are partners and share data according to their privacy policies. +The user's objective aligns with a predefined service pipeline \st{template} that orchestrates the following sequence of operations: +\begin{enumerate*}[label=(\roman*)] + \item \emph{Data fetching}, including the download of the dataset from other states; + \item \emph{Data preparation}, including data merging, cleaning, and anonymization; + % \hl{QUESTO E' MERGE (M). IO PENSAVO DIVENTASSE UN NODO $v_i$. NEL CASO CAMBIANDO LA DEFINIZIONE 3.1 DOVE NON ESISTONO PIU' I NODI MERGE E JOIN.} + \item \emph{Data analysis}, including statistical measures like average, median, and clustering-based statistics; + \item \emph{Data storage}, including the storage of the results; + \item \emph{Data visualization}, including the visualization of the results. +\end{enumerate*} + +A visual representation of the service pipeline is presented in Figure \ref{fig:reference_scenario}. + + \textcolor{red}{The department has specified some security requirements: the entire service execution must occur within the Connecticut Department of Correction. Moreover, if data transmission extends beyond Connecticut's borders, data protection measures must be implemented.} +% \begin{figure}[ht!] \centering \begin{tikzpicture}[scale=0.9,y=-1cm] @@ -160,7 +149,7 @@ \subsection{Service Pipeline and Reference Scenario}\label{sec:service_definitio % \draw[->] (node6) -- (node7); \end{tikzpicture} - \caption{Reference Scenario} + \caption{Service pipeline of the reference scenario} \label{fig:reference_scenario} \end{figure}