From 3eaa14eb5696f45e2763845f0081e11c18809b00 Mon Sep 17 00:00:00 2001 From: Claudio Ardagna Date: Thu, 13 Jun 2024 15:18:01 +0200 Subject: [PATCH] claudio --- introduction.tex | 9 +---- pipeline_instance.tex | 12 ++----- pipeline_template.tex | 42 ++++------------------ pipeline_template_example.tex | 25 ++++--------- system_model.tex | 68 +++++++---------------------------- 5 files changed, 27 insertions(+), 129 deletions(-) diff --git a/introduction.tex b/introduction.tex index 6ff22e5..bb18d2d 100644 --- a/introduction.tex +++ b/introduction.tex @@ -1,6 +1,5 @@ \section{Introduction} The wide success and adoption of cloud infrastructures and their intrinsic multitenancy represent a paradigm shift in the big data scenario, redefining scalability and efficiency in data analytics. Multitenancy enables multiple users to share resources, such as computing power and storage, optimizing their utilization and reducing operational costs. Leveraging cloud infrastructure further enhances flexibility and scalability. -%allowing organizations to dynamically allocate resources based on demand while ensuring seamless access to cutting-edge data analytics tools and services. % The flip side of multitenancy is the increased complexity in data governance: the shared model introduces unique security challenges, as tenants may have different security requirements, access levels, and data sensitivity. Adequate measures such as encryption, access control mechanisms, and data anonymization techniques must be implemented to safeguard data against unauthorized access and ensure compliance with regulatory requirements such as GDPR or HIPAA. % @@ -13,25 +12,19 @@ \section{Introduction} When evaluating a solution meeting these criteria, the following questions naturally arise: \begin{enumerate} \item How does a robust data protection policy affect analytics? -%What impact does a robust and strong data protection policy have on analytics? -%How does a strong data protection policy influence analytics? -%\item When considering a pipeline, is it more advantageous to perform data protection transformations at each step rather than filtering all data at the outset? \item When considering a (big data) pipeline, should data protection be implemented at each pipeline step rather than filtering all data at the outset? \item In a scenario where a user has the option to choose among various candidate services, how might these choices affect the analytics? \end{enumerate} Based on the aforementioned considerations, we propose a data governance framework for modern data-driven pipelines, designed to mitigate privacy and security risks. The primary objective of this framework is to support the selection and assembly of data processing services within the pipeline, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements. To this aim, each element of the pipeline is \textit{annotated} with \emph{i)} data protection requirements expressing transformation on data and \emph{ii)} functional specifications on services expressing data manipulations carried out during each service execution. -%Each element in the pipeline, there is a catalog of candidate services among which the user running it can choose. Services may be functionally equivalent (i.e., they perform the same tasks), but have different security policies, more or less restrictive depending on the organization or service provider they belong to. Though applicable to a generic scenario, our data governance approach starts from the assumption that maintaining a larger volume of data leads to higher data quality; as a consequence, its service selection algorithm focuses on maximizing data quality by retaining the maximum amount of information when applying data protection transformations. -%We have a running example on which we conducted experiments to provide an initial answer to the questions highlighted earlier. The primary contributions of the paper can be summarized as follows: \begin{enumerate*} - \item Defining a data governance framework supporting selection and assembly of data processing services enriched with meta-data that describes both data protection and functional requirements; + \item Defining a data governance framework supporting selection and assembly of data processing services enriched with metadata that describe both data protection and functional requirements; \item Proposing a parametric heuristic tailored to address the computational complexity of the NP-hard service selection problem; \item Evaluating the performance and quality of the algorithm through experiments conducted using the dataset from the running example. \end{enumerate*} - The remainder of the paper is structured as follows: Section 2, presents our system model, illustrating a reference scenario where data is owned by multiple organizations. Section \ref{sec:template} introduces the pipeline template and describe data protection and functional annotations. Section \ref{sec:instance} describes the process of building a pipeline instance from a pipeline template according to service selection. Section \ref{sec:heuristics} introduces the quality metrics used in service selection and the heuristic solving the service selection problem. Section \ref{sec:experiment} presents our experimental results. Section \ref{sec:related} discusses the state of the art and Section \ref{sec:conclusions} draws our concluding remarks. \ No newline at end of file diff --git a/pipeline_instance.tex b/pipeline_instance.tex index da9e2cf..359d622 100644 --- a/pipeline_instance.tex +++ b/pipeline_instance.tex @@ -1,5 +1,4 @@ \section{Pipeline Instance}\label{sec:instance} -%Given a set of candidate services, a A \pipelineInstance $\iChartFunction$ instantiates a \pipelineTemplate \tChartFunction by selecting and composing services according to data protection and functional annotations in the template. It is formally defined as follows. \vspace{0.5em} \begin{definition}[Pipeline Instance]\label{def:instance} @@ -14,7 +13,9 @@ \section{Pipeline Instance}\label{sec:instance} \item $s'_i$ satisfies functional annotation \myGamma(\vi{i}) in \tChartFunction. \end{enumerate} \end{definition} + \vspace{0.5em} + Condition 1 requires that each selected service \sii{i} satisfies the policy requirements \P{i} of the corresponding vertex \vi{i} in the \pipelineTemplate, whereas Condition 2 is needed to preserve the process functionality, as it simply states that each service \sii{i} must satisfy the functional requirements \F{i} of the corresponding vertex \vi{i} in the \pipelineTemplate. We then define a \emph{pipeline instantiation} function that takes as input a \pipelineTemplate \tChartFunction and a set $S^c$ of candidate services, split in a specific set of services $S^c_{i}$ for each vertex \vi{i}$\in$$\V_S$, and returns as output a \pipelineInstance \iChartFunction. Recall from Section~\ref{sec:funcannotation} that all candidate services meet the functional annotation in the template, meaning that Condition 2 in Definition~\ref{def:instance} is satisfied for all candidate services. @@ -37,7 +38,6 @@ \section{Pipeline Instance}\label{sec:instance} % vertexes \node[draw, circle, fill,text=white,minimum size=1 ] (sr) at (0,0) {}; - % \node[draw, circle] (node2) at (1,0) {$\s{1}$}; \node[draw, circle, plus,minimum size=1.5em] (plus) at (1.5,0) {}; \node[draw, circle] (s2) at (3.5,-2) {$\sii{1}$}; @@ -57,12 +57,10 @@ \section{Pipeline Instance}\label{sec:instance} \node[above] at (s3.north) {\function{2}}; \node[above] at (s4.north) {\function{4}}; \node[above] at (s5.north) {\function{5}}; - % \node[above] at (s6.north) {\function{}}; \node[above] at (s6.north) {\function{6}}; \node[above] at (s7.north) {\function{7}}; % Connection - % \draw[->] (node2) -- (node3); \draw[->] (sr) -- (plus); \draw[->] (plus) -- (s1); \draw[->] (plus) -- (s2); @@ -71,14 +69,8 @@ \section{Pipeline Instance}\label{sec:instance} \draw[->] (s1) -- (s4); \draw[->] (s2) -- (s4); \draw[->] (s3) -- (s4); - % \draw[->] (node6) -- (node65); - % \draw[->] (node65) -- (node7);3 \draw[->] (s4) -- (s5); \draw[->] (s5) -- (s6); - % \draw[->] (cross) -- (s5); - % \draw[->] (cross) -- (s6); - % \draw[->] (s5) -- (s7); - % \draw[->] (s6) -- (s7); \draw[->] (s6) -- (s7); \end{tikzpicture} diff --git a/pipeline_template.tex b/pipeline_template.tex index 7e24005..37448e1 100644 --- a/pipeline_template.tex +++ b/pipeline_template.tex @@ -1,6 +1,5 @@ \section{Pipeline Template}\label{sec:template} -Our approach integrates data protection and data management into the service pipeline using annotations. -To this aim, we extend the service pipeline in \cref{def:pipeline} with: \emph{i)} data protection annotations that also express transformations on data, ensuring compliance with data protection requirements, \emph{ii)} functional annotations to express data manipulations carried out during service execution. +Our approach integrates data protection and data management into the service pipeline using annotations. To this aim, we extend the service pipeline in \cref{def:pipeline} with: \emph{i)} data protection annotations that express transformations on data, ensuring compliance with data protection requirements, \emph{ii)} functional annotations that express data manipulations carried out during service execution. These annotations enable the implementation of an advanced data lineage, tracking the entire data lifecycle by monitoring changes that result from functional service execution and data protection requirements. In the following, we first introduce the annotated service pipeline, called pipeline template (Section \ref{sec:templatedefinition}). We then present both functional annotations (Section \ref{sec:funcannotation}) and data protection annotations (Section \ref{sec:nonfuncannotation}), providing an example of a pipeline template in the context of the reference scenario. @@ -11,7 +10,6 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition} \item an annotation function \myLambda:$\V_S\rightarrow$\P{} that associates a set of data protection requirements, in the form of policies $p$$\in$\P{}, with each vertex \vi{i}$\in$$\V_S$; \item an annotation function \myGamma:$\V_S\rightarrow$\F{} that associates a functional service description $F_i\in\F{}$ with each vertex \vi{i}$\in$$\V_S$. \end{enumerate*} -%The policies will be intended to guide the enforcement of data protection while the data transformation function will characterize the functional aspect of each vertex. The template is formally defined as follows. @@ -27,18 +25,14 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition} \vspace{0.5em} -We note that, at this stage, the template is not yet linked to any service. We also note that policies $p_j$$\in$\P{i} in \myLambda(\vi{i}) are combined using logical OR, meaning that the access decision is positive if at least one policy $p_j$ evaluates to \emph{true}. - %We also note that functional description $F_i$ includes the specific data transformation triggered as the result of a service execution. - The pipeline template of the service pipeline of \cref{fig:reference_scenario} is depicted in \cref{fig:service_composition_template}. +We note that, at this stage, the template is not yet linked to any service. We also note that policies $p_j$$\in$\P{i} in \myLambda(\vi{i}) are combined using logical OR, meaning that the access decision is positive if at least one policy $p_j$ evaluates to \emph{true}. The pipeline template of the service pipeline of \cref{fig:reference_scenario} is depicted in \cref{fig:service_composition_template}. - %The next sections better explain the functional and non-functional transformation functions. \begin{figure}[ht!] \centering \newcommand{\function}[1]{$\ensuremath{\myLambda_{#1},\myGamma_{#1}}$} \begin{tikzpicture}[scale=0.9] % vertexes \node[draw, circle, fill,text=white,minimum size=1 ] (sr) at (0,0) {}; - % \node[draw, circle] (node2) at (1,0) {$\s{1}$}; \node[draw, circle, plus,minimum size=1.5em] (plus) at (1.5,0) {}; \node[draw, circle] (s1) at (3,1.7) {$\vi{3}$}; \node[draw, circle] (s2) at (3,-1.7) {$\vi{1}$}; @@ -48,9 +42,6 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition} \node[draw, circle] (s4) at (4.5,0) {$\vi{4}$}; \node[draw, circle] (s5) at (6,0) {$\vi{5}$}; - % \node[draw, circle, cross,minimum size=1.5em] (cross) at (6,0) {}; - %\node[draw, circle] (s5) at (7.5,1.2) {$\vi{5}$}; - %\node[draw, circle] (s6) at (7.5,-1.2) {$\vi{6}$}; \node[draw, circle] (s7) at (7.5,0) {$\vi{6}$}; \node[draw, circle] (s8) at (9,0) {$\vi{7}$}; @@ -63,12 +54,10 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition} \node[above] at (s3.north) {\function{2}}; \node[above] at (s4.north) {\function{4}}; \node[above] at (s5.north) {\function{5}}; - % \node[above] at (s6.north) {\function{}}; \node[above] at (s7.north) {\function{6}}; \node[above] at (s8.north) {\function{7}}; % Connection - % \draw[->] (node2) -- (node3); \draw[->] (sr) -- (plus); \draw[->] (plus) -- (s1); \draw[->] (plus) -- (s2); @@ -77,15 +66,8 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition} \draw[->] (s1) -- (s4); \draw[->] (s2) -- (s4); \draw[->] (s3) -- (s4); - % \draw[->] (node6) -- (node65); - % \draw[->] (node65) -- (node7);3 \draw[->] (s4) -- (s5); \draw[->] (s5) -- (s7); - %\draw[->] (s4) -- (cross); - %\draw[->] (cross) -- (s5); - %\draw[->] (cross) -- (s6); - % \draw[->] (s5) -- (s7); - % \draw[->] (s6) -- (s7); \draw[->] (s7) -- (s8); \end{tikzpicture} @@ -114,18 +96,13 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition} More in detail, \textit{subject subj} specifies a service $s_i$ issuing an access request to perform an action on an object. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, (classifier$=$``SVM'') specifies a service providing a SVM classifier. We note that \textit{subj} can also specify conditions on the service owner (\textit{e.g.}, owner\_location$=$``EU'') and the service user (\textit{e.g.}, service\_user\_role$=$``DOC Director''). - %\item - \textit{Object obj} defines the data governed by the access policy. In this case, it is a set \{$pc_i$\} of \emph{Policy Conditions} on the object's attributes. %as defined in Definition \ref{def:policy_cond}. - %It can specify the \emph{type} of object, such as a file (e.g., a video, text file, image, etc.), a SQL or noSQL database, a table, a column, a row, or a cell of a table, or any other characteristics of the data. - For instance, \{(type$=$``dataset''), (region$=$CT)\} refers to an object of type dataset and whose region is Connecticut. + \textit{Object obj} defines the data governed by the access policy. In this case, it is a set \{$pc_i$\} of \emph{Policy Conditions} on the object's attributes. + For instance, \{(type$=$``dataset''), (region$=$``CT'')\} refers to an object of type dataset and whose region is Connecticut. - %\item \textit{Action act} specifies the operations that can be performed within a big data environment, from traditional atomic operations on databases (e.g., CRUD operations) to coarser operations, such as an Apache Spark Direct Acyclic Graph (DAG), Hadoop MapReduce, an analytics function call, and an analytics pipeline. - %\item \textit{Environment env} defines a set of conditions on contextual attributes, such as time of the day, location, IP address, risk level, weather condition, holiday/workday, and emergency. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, (time$=$``night") refers to a policy that is applicable only at night. - %\item \textit{Data Transformation \TP} defines a set of security and privacy-aware transformations on \textit{obj} that must be enforced before any access to data is given. Transformations focus on data protection, as well as on compliance to regulations and standards, in addition to simple format conversions. For instance, let us define three transformations that can be applied to the dataset in \cref{tab:dataset}, each performing different levels of anonymization: \begin{enumerate*}[label=\roman*)] \item level \emph{l0} (\tp{0}): no anonymization; @@ -133,19 +110,12 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition} \item level \emph{l2} (\tp{2}): full anonymization with first name, last name, identifier and age being anonymized. \end{enumerate*} - - -Access control policies $p_j$$\in$\P{i} annotating a vertex \vi{i} in a pipeline template $G^{\myLambda,\myGamma}$ specify the data protection requirements that candidate service must fulfill to be selected in the pipeline instance. Section~\ref{sec:instance} describes the selection process and pipeline instance generation. - - % To conclude, access control policies $p_j$$\in$\P{i} annotating vertex \vi{i} in a pipeline template $G^{\myLambda,\myGamma}$ filters out those candidate services $s$$\in$$S^c$ that do not match data protection requirements. Specifically, each policy $p_j$$\in$\P{i} verifies whether a candidate service $s$$\in$$S^c$ for vertex \vi{i} is compatible with data protection requirements in \myLambda(\vi{i}). Policy evaluation matches the profile \profile\ of a candidate service $s$$\in$$S^c$ with the policy conditions in each $p_j$$\in$\P{i}. If the credentials and declarations, defined as a set of attributes in the form (\emph{name}, \emph{value}), in the candidate service profile fails to meet the policy conditions, meaning that no policies $p_j$ evaluates to \emph{true}, the service is discarded; otherwise it is added to the set $S'$ of compatible service, which is used in Section~\ref{sec:instance} to generate the pipeline instance $G'$. No policy enforcement is done at this stage. +Access control policies $p_j$$\in$\P{i} annotating a vertex \vi{i} in a pipeline template $G^{\myLambda,\myGamma}$ specify the data protection requirements that a candidate service must fulfill to be selected in the pipeline instance. Section~\ref{sec:instance} describes the selection process and the pipeline instance generation. \subsection{Functional Annotations}\label{sec:funcannotation} A proper data management approach must track functional data manipulations across the entire pipeline execution, defining the functional requirements of each service operating on data. To this aim, each vertex \vi{i}$\in\V_S$ is annotated with a label \myGamma(\vi{i}), corresponding to the functional description $F_i$ of the service $s_i$ represented by \vi{i}. -$F_i$ describes the functional requirements on the corresponding service $s_i$, such as API, inputs, expected outputs. -%The latter is modeled as a functional transformation function \TF\ that is applied to the data when executing service $s_i$. \TF\ has a twofold role: -%\begin{enumerate}[label=\roman*)] -% \item it contains the functional requirements that the service must satisfy, in terms of expected input, expected output, prototype and other functional aspects. +$F_i$ describes the functional requirements, such as API, inputs, expected outputs. It also specifies a set \TF{} of data transformation functions \tf{i}, which can be triggered during the execution of the corresponding service $s_i$. Function $\tf{i}$$\in$$\TF{}$ can be: diff --git a/pipeline_template_example.tex b/pipeline_template_example.tex index 6b6dc19..e514762 100644 --- a/pipeline_template_example.tex +++ b/pipeline_template_example.tex @@ -1,5 +1,3 @@ -%\subsection{Example}\label{sec:example_template} - \begin{table*}[ht!] \def\arraystretch{1.5} \centering @@ -12,15 +10,13 @@ \vi{1},\vi{2},\vi{3} & $\p{0}$ & \policy{ANY}{dataset}{READ}{ANY}{\tp{0}} \\ \vi{4},\vi{5} & $\p{1}$ & \policy{\{\pone\}}{dataset}{READ}{ANY}{\tp{0}} \\ \vi{4},\vi{5} & $\p{2}$ & \policy{\{\ptwo\}}{dataset}{READ}{ANY}{\tp{1}} \\ - %\vi{4},\vi{6} & $\p{3}$ & \policy{\pthree}{dataset}{READ}{ANY}{\tp{2}} \\ - \vi{6} & $\p{3}$ & \policy{\{$(service\_region= dataset\_origin)$\}}{dataset}{WRITE}{ANY}{\tp{0}} \\ - \vi{6} & $\p{4}$ & \policy{\{$(service\_region=\{NY,NH\})$\}}{dataset}{WRITE}{ANY}{\tp{1}} \\ - \vi{7} & $\p{5}$ & \policy{ANY}{dataset} {READ}{\langle$environment = risky$\rangle}{\tp{3}} \\ - \vi{7} & $\p{6}$ & \policy{ANY}{dataset} {READ}{\langle$environment = not\_risky$\rangle}{\tp{4}} \\ + \vi{6} & $\p{3}$ & \policy{\{$(service\_region=dataset\_origin)$\}}{dataset}{WRITE}{ANY}{\tp{0}} \\ + \vi{6} & $\p{4}$ & \policy{\{$(service\_region=\{``NY'',``NH''\})$\}}{dataset}{WRITE}{ANY}{\tp{1}} \\ + \vi{7} & $\p{5}$ & \policy{ANY}{dataset} {READ}{\langle$environment=``risky''$\rangle}{\tp{3}} \\ + \vi{7} & $\p{6}$ & \policy{ANY}{dataset} {READ}{\langle$environment=``not\_risky''$\rangle}{\tp{4}} \\ \end{tabular} & - %\caption{Anonymization levels}\label{tab:levels} \begin{tabular}[t]{c|c|l} \textbf{\tp{i}} & \textbf{Level} & \textbf{Data Transformation} \\\hline \tp{0} & $l_0$ & $anon(\varnothing)$ \\ @@ -38,7 +34,6 @@ \begin{example}[\bf \pipelineTemplate]\label{ex:template} Let us consider the reference scenario introduced in \cref{sec:systemmodel}. \cref{fig:service_composition_template} presents an example of pipeline template consisting of five stages, each one annotated with a policy in \cref{tab:anonymization}. -% We recall that \cref{tab:dataset} shows a sample of our reference dataset. % 1° NODO % The first stage consists of three parallel vertices \vi{1}, \vi{2}, \vi{3} for data collection. @@ -48,23 +43,17 @@ The second stage consists of vertex \vi{4}, merging the three datasets obtained at the first stage. Data protection annotation \myLambda(\vi{4}) refers to policies \p{1} and \p{2}, which apply different data transformations depending on the relation between the dataset and the service owner. % 2° NODO % If the service owner is also the dataset owner (i.e., \pone), the dataset is not anonymized (\tp{0}). If the service owner is a partner of the dataset owner (i.e., \ptwo), the dataset is anonymized at \emph{level1} (\tp{1}). If the service owner has no partner relationship with the dataset owner, no policy applies. -%if the service owner is neither the dataset owner nor a partner of the dataset owner (\pthree), the dataset is anonymized level2 (\tp{2}). Functional requirement \F{4} prescribes $n$ datasets as input and the merged dataset as output. % 3° NODO % The third stage consists of vertex \vi{5} for data analysis. Data protection annotation \myLambda(\vi{5}) refers to policies \p{1} and \p{2}, as for the second stage. -% The logic remains consistent: -% if the service profile matches with the data owner (\pone), \p{1} matches and level0 anonymization is applied (\tp{0}); -% if the service profile matches with a partner of the owner (\ptwo), \p{2} matches and level1 anonymization is applied (\tp{1}); -% if the service profile doesn't match with a partner nor with the owner (\pthree), \p{3} matches and level2 anonymization is applied (\tp{2}). Functional requirement \F{5} prescribes a dataset as input and the results of the data analysis as output. % 5° NODO % - The fourth stage consists of vertex \vi{6}, managing data storage. Data protection annotation \myLambda(\vi{6}) refers to policies \p{3} and \p{4}, which apply different data transformations depending on the relation between the dataset and the service region. -If the service region is the dataset origin (condition $(service\_region=dataset\_origin)$ in \p{3}) , the dataset is anonymized at level $l_0$ (\tp{0}). -If the service region is in a partner region (condition $(service\_region=\{NY,NH\})$ in \p{4}), the dataset is anonymized at level $l_1$ (\tp{1}). +If the service region is the dataset origin (condition $(service\_region$$=$$dataset\_origin)$ in \p{3}), the dataset is anonymized at level $l_0$ (\tp{0}). +If the service region is in a partner region (condition ($service\_region$=\{``$NY$'',``$NH$''\}) in \p{4}), the dataset is anonymized at level $l_1$ (\tp{1}). Functional requirement \F{7} prescribes a dataset as input and the URI of the stored data as output. % 6° NODO % @@ -74,6 +63,4 @@ If the environment is risky (\p{5}), the data are anonymized at level $r_0$ (\tp{3}). If the environment is not risky (\p{6}), the data are anonymized at level $r_1$ (\tp{4}). Functional requirement \F{8} prescribes a dataset as input and data visualization interface (possibly in the form of JSON file) as output. - -%In summary, this tion has delineated a comprehensive pipeline template. This illustrative pipeline serves as a blueprint, highlighting the role of policy implementation in safeguarding data protection across diverse operational stages. \end{example} \ No newline at end of file diff --git a/system_model.tex b/system_model.tex index dd1e563..4278a98 100644 --- a/system_model.tex +++ b/system_model.tex @@ -1,44 +1,39 @@ \section{System Model and Reference Scenario}\label{sec:requirements} -We present our system model (Section \ref{sec:systemmodel}) and our reference scenario (Section \ref{sec:service_definition}). %\textcolor{red}{that consists of a service pipeline for analyzing a dataset of individuals awaiting trial detained in the Department of Correction facilities in the state of Connecticut}. +We present our system model (Section \ref{sec:systemmodel}) and our reference scenario (Section \ref{sec:service_definition}). \subsection{System Model}\label{sec:systemmodel} We consider a service-based environment where a service pipeline is designed to analyze data. Our system model is derived by a generic big-data framework and enriched with metadata specifying data protection requirements and functional specifications. It is composed of the following parties: \begin{itemize} \item \emph{Service}, a software distributed by a service provider that performs a specific task; \item \emph{Service Pipeline}, a sequence of connected services that collect, prepare, process, and analyze data in a structured and automated manner; - \item \emph{Data Governance Policy}, a structured set of privacy guidelines, rules, and procedures regulating data access, sharing, and protection; %\textcolor{red}{In particular, each component service in the pipeline is annotated with data protection requirements and functional specifications.} + \item \emph{Data Governance Policy}, a structured set of privacy guidelines, rules, and procedures regulating data access, sharing, and protection; \item \emph{User}, executing an analytics pipeline on the data. We assume the user is authorized to perform this operation, either as the data owner or as a data processor with the owner's consent. - \item \emph{Dataset}, the data target of the analytics pipeline. We assume all data are ready for analysis, that is, they underwent a preparatory phase addressing issues such as missing values, outliers, and formatting discrepancies. %This ensures that the data are in an optimal state for subsequent analysis.} + \item \emph{Dataset}, the data target of the analytics pipeline. We assume all data are ready for analysis, that is, they underwent a preparatory phase addressing issues such as missing values, outliers, and formatting discrepancies. \end{itemize} \vspace{0.5em} -A service pipeline is a graph formally defined as follows. % and depicted in \cref{fig:service_pipeline}. +A service pipeline is a graph formally defined as follows. \vspace{0.5em} -\begin{definition}[\pipeline]\label{def:pipeline} - % A \pipeline is as a direct acyclic graph G(\V,\E), where \V\ is a set of vertices and \E\ is a set of edges connecting two vertices \vi{i},\vi{k}$\in$\V. The graph has a root \vi{r}$\in$\V, a vertex \vi{i}$\in$\V$_S$ for each service $s_i$, two additional vertices \vi{c},\vi{m}$\in$\V$_{\timesOperator}$$\subset$\V\ for each alternative ($\timesOperator$) structure modeling the alternative execution (\emph{choice}) of operations and the retrieval (\emph{merge}) of the results, respectively, and one additional vertex \vi{f} $\in$\V$_{\plusOperator}$$\subset$\V\ for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of operations. +\begin{definition}[Service \pipeline]\label{def:pipeline} A \pipeline is as a direct acyclic graph G(\V,\E), where \V\ is a set of vertices and \E\ is a set of edges connecting two vertices \vi{i},\vi{k}$\in$\V. - The graph has a root ($\bullet$) vertex \vi{r}$\in$\V , a vertex \vi{i}$\in$\V$_S$ for each service $s_i$, an additional vertex \vi{f}$\in$\V\ for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of services. + The graph has a root ($\bullet$) vertex \vi{r}$\in$\V, a vertex \vi{i}$\in$$V_S$ for each service $s_i$, an additional vertex \vi{f}$\in$\V\ for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of services. \end{definition} \vspace{0.5em} -We note that \V$=$\{\vi{r},\vi{f}\}$\cup$\V$_S$, with vertices \vi{f} modeling branching for parallel structures, and root \vi{r} possibly representing the orchestrator. In addition, for simplicity but no lack of generality, alternative structures modeling the alternative execution of services are specified as alternative service pipelines, that is, there is no alternative structure in a single service pipeline. +We note that \V$=$\{\vi{r},\vi{f}\}$\cup$$V_S$, with vertices \vi{f} modeling branching for parallel structures, and root \vi{r} possibly representing the orchestrator. In addition, for simplicity but no lack of generality, alternative structures modeling the alternative execution of services are specified as alternative service pipelines, that is, there is no alternative structure in a single service pipeline. -We refer to the service pipeline annotated with both functional and non-functional requirements, as the \textbf{pipeline template}. It acts as a skeleton, specifying both the structure of the pipeline, that is, the chosen sequence of desired services, and the functional and non-functional requirements. We note that, in our multi-tenant cloud-based ecosystem, each element within the pipeline may have a catalog of candidate services. A pipeline template is then instantiated in a \textbf{pipeline instance} by selecting the most suitable candidates from the pipeline template. +We refer to the service pipeline annotated with both functional and non-functional requirements, as the \textbf{pipeline template}. It acts as a skeleton, specifying both the structure of the pipeline, that is, the chosen sequence of desired services, and the functional and non-functional requirements for each component service. We note that, in our multi-tenant cloud-based ecosystem, each element within the pipeline may have a catalog of candidate services. A pipeline template is then instantiated in a \textbf{pipeline instance} by selecting the most suitable candidates from the pipeline template. This process involves retrieving a set of compatible services for each vertex in the template, ensuring that each service meets the functional requirements and aligns with the policies specified in the template. Since we also consider security policies that may necessitate security and privacy-aware data transformations, compatible services are ranked based on their capacity to fulfill the policy while preserving the maximum amount of information (\emph{data quality} in this paper). Indeed, our data governance approach, though applicable in a generic scenario, operates under the assumption that \textit{preserving a larger quantity of data correlates with enhanced data quality}, a principle that represents many real-world scenarios. However, we acknowledge that this assumption may not universally apply and remain open to exploring alternative solutions in future endeavors. % The best service is then selected to instantiate the corresponding component service in the template. Upon selecting the most suitable service for each component service in the pipeline template, the pipeline instance is completed and ready for execution. -%This because our data governance approach builds on the following assumption: \emph{upholding a larger quantity of data is linked to better data quality.} -%While this assumption is not true in all settings, it correctly represents many real-world scenarios. We leave a solution that departs from this assumption to our future work. - \subsection{Reference Scenario}\label{sec:service_definition} - Our reference scenario considers a service pipeline analyzing a dataset of individuals detained in the Department of Correction facilities in the state of Connecticut while awaiting trial\footnote{https://data.ct.gov/Public-Safety/Accused-Pre-Trial-Inmates-in-Correctional-Faciliti/b674-jy6w}. \cref{tab:dataset} presents a sample of the adopted dataset. Each row represents an inmate; each column includes the following attributes: date of download, a unique identifier, last entry date, race, gender, age of the individual, the bound value, offense, entry facility, and detainer. To serve the objectives of our study, we extended this dataset by introducing randomly generated first and last names. @@ -68,22 +63,17 @@ \subsection{Reference Scenario}\label{sec:service_definition} \end{table*} -In this context, the user, a member of the Connecticut Department of Correction (DOC), is interested to compare admission trends in Connecticut prisons with the ones in New York and New Hampshire. We assume that the three DOCs are partners and share data according to their privacy policies. The entire service execution must occur within the Connecticut Department of Correction. Moreover, if data transmission extends beyond Connecticut's borders, data protection measures must be implemented. +In this context, the user, a member of the Connecticut Department of Correction (DOC), is interested to compare the admission trends in Connecticut prisons with the ones in New York and New Hampshire. We assume that the three DOCs are partners and share data according to their privacy policies. Moreover, the policy specifies that the entire service execution must occur within the Connecticut Department of Correction. In case data transmission extends beyond Connecticut's borders, data protection measures must be implemented. -The user's objective aligns with a predefined service pipeline %\st{template} -that orchestrates the following sequence of operations: +The user's objective aligns with the predefined service pipeline in Figure \ref{fig:reference_scenario} that orchestrates the following sequence of operations: \begin{enumerate*}[label=(\roman*)] \item \emph{Data fetching}, including the download of the dataset from other states; \item \emph{Data preparation}, including data merging, cleaning, and anonymization; - % \hl{QUESTO E' MERGE (M). IO PENSAVO DIVENTASSE UN NODO $v_i$. NEL CASO CAMBIANDO LA DEFINIZIONE 3.1 DOVE NON ESISTONO PIU' I NODI MERGE E JOIN.} \item \emph{Data analysis}, including statistical measures like average, median, and clustering-based statistics; \item \emph{Data storage}, including the storage of the results; \item \emph{Data visualization}, including the visualization of the results. \end{enumerate*} -A visual representation of the service pipeline is presented in Figure \ref{fig:reference_scenario}. -%\textcolor{red}{The department has specified some security requirements: the entire service execution must occur within the Connecticut Department of Correction. Moreover, if data transmission extends beyond Connecticut's borders, data protection measures must be implemented.} -% \begin{figure}[ht!] \centering \begin{tikzpicture}[scale=0.9,y=-1cm] @@ -100,34 +90,14 @@ \subsection{Reference Scenario}\label{sec:service_definition} \node[draw, circle,below=1em] (merge) at (node2.south) {$\vi{4}$}; \node[draw, circle,below=1em] (node5) at (merge.south) {$\vi{5}$}; - % \node[draw, circle, cross , minimum size=1.5em,below=1em] (fork) at (merge.south) {}; - % \node[draw, circle,below =1.5em, left=2em] (ml) at (fork.south) {$\vi{5}$}; - % \node[draw, circle,below =1.5em, right=2em] (analysis) at (fork.south) {$\vi{6}$}; - % \node[draw, circle, cross , minimum size=1.5em,below=3em] (join) at (fork.south) {}; - \node[draw, circle,below =1em ] (storage) at (node5.south) {$\vi{6}$}; \node[draw, circle,below =1.5em] (visualization) at (storage.south) {$\vi{7}$}; % Labels - % \node[right=1em] at (node3.east) {Data fetching}; - % \node[right=1em] at (merge.east) {Data preparation}; - % \node[right=1em] at (split.east) {$parallel$}; - % % \node[right=1em] at (fork.east) {$alternative$}; - % % \node[right=1em] at (analysis.east) {ML task}; - % % \node[left=1em] at (ml.west) {Data analysis}; - % \node[right=1em] at (storage.east) {Data Storage}; - % \node[right=1em] at (visualization.east) {Data Visualization}; - % \node[draw, circle,below =1em ] (storage) at (node5.south) {$\vi{6}$}; - % \node[draw, circle,below =1.5em] (visualization) at (storage.south) {$\vi{7}$}; - - % Labels - \node[right=1em] at (node3.east) {i) Data fetching}; \node[right=1em] at (merge.east) {ii) Data preparation}; \node[right=1em] at (split.east) {$parallel$}; - % \node[right=1em] at (fork.east) {$alternative$}; - % \node[right=1em] at (analysis.east) {ML task}; \node[right=1em] at (node5.east) {iii) Data analysis}; \node[right=1em] at (storage.east) {iv) Data Storage}; \node[right=1em] at (visualization.east) {v) Data Visualization}; @@ -144,24 +114,10 @@ \subsection{Reference Scenario}\label{sec:service_definition} \draw[->] (node3) -- (merge); \draw[->] (merge) -- (node5); \draw[->] (node5) -- (storage); - % \draw[->] (fork) -- (ml); - % \draw[->] (fork) -- (analysis); - - % \draw[->] (analysis) -- (storage); - % \draw[->] (ml) -- (storage); - % \draw[->] (merge) -- (fork); + \draw[->] (storage) -- (visualization); - % \draw[->] (node3) -- (node6); - % \draw[->] (node4) -- (node6); - % \draw[->] (node5) -- (node6); - % \draw[->] (node6) -- (node7); \end{tikzpicture} - \caption{Service pipeline of the reference scenario} + \caption{Service pipeline in the reference scenario} \label{fig:reference_scenario} \end{figure} - -% Scarichiamo tre dataset, nessuna anonimizzazione, nodo di merge, anonimizzo e pulisco tutto, -%nodi alternativa ML e analisi, merge, storage, visulazzionezione -%aggiungere nodo finale -%agigungere nodo \ No newline at end of file