diff --git a/main.pdf b/main.pdf index c1e1a32..7f3ec7a 100644 Binary files a/main.pdf and b/main.pdf differ diff --git a/pipeline_template.tex b/pipeline_template.tex index d22f918..b00a02d 100644 --- a/pipeline_template.tex +++ b/pipeline_template.tex @@ -1,31 +1,30 @@ \section{Pipeline Template}\label{sec:template} Our approach integrates data protection and data management into the service pipeline using annotations. -To this aim, we extend the service pipeline in \cref{def:pipeline} with: \emph{i)} data protection annotations expressing transformations on data to enforce data protection requirements, \emph{ii)} functional annotations expressing data manipulations carried out during services execution. -These annotations permit to implement an advanced data lineage, tracking the entire data lifecycle by monitoring changes arising from functional service execution and data protection requirements. +To this aim, we extend the service pipeline in \cref{def:pipeline} with: \emph{i)} data protection annotations to express transformations on data, ensuring compliance with data protection requirements, \emph{ii)} functional annotations to express data manipulations carried out during services execution. +These annotations enable the implementation of advanced data lineage, tracking the entire data lifecycle by monitoring changes that result from functional service execution and data protection requirements. In the following, we first introduce the annotated service pipeline, called pipeline template (Section \ref{sec:templatedefinition}). We then present functional annotations (Section \ref{sec:funcannotation}) and data protection annotations (Section \ref{sec:nonfuncannotation}). We finally provide an example of a pipeline template (Section \ref{sec:example}). \subsection{Pipeline Template Definition}\label{sec:templatedefinition} -Given the service pipeline in Definition~\ref{def:pipeline}, we use annotations to express data protection requirements to be enforced on data and functional requirements on services to be integrated in the pipeline. Each service vertex in the service pipeline is labeled with two mapping functions forming a pipeline template: +Given the service pipeline in Definition~\ref{def:pipeline}, we use annotations to express data protection requirements to be enforced on data and functional requirements on the services to be integrated in the pipeline. Each service vertex in the service pipeline is labeled with two mapping functions forming a pipeline template: \begin{enumerate*}[label=\textit{\roman*})] - \item a labeling function \myLambda:$\V_S\rightarrow$\P{} that associates a set of data protection requirements, in the form of policies $p$$\in$\P{}, with each vertex \vi{i}$\in$$\V_S$; - \item a labeling function \myGamma:$\V_S\rightarrow$\F{} that associates a functional service description $F_i\in\F{}$ with each vertex \vi{i}$\in$$\V_S$. + \item an annotation function \myLambda:$\V_S\rightarrow$\P{} that associates a set of data protection requirements, in the form of policies $p$$\in$\P{}, with each vertex \vi{i}$\in$$\V_S$; + \item an annotation function \myGamma:$\V_S\rightarrow$\F{} that associates a functional service description $F_i\in\F{}$ with each vertex \vi{i}$\in$$\V_S$. \end{enumerate*} %The policies will be intended to guide the enforcement of data protection while the data transformation function will characterize the functional aspect of each vertex. The template is formally defined as follows. \begin{definition}[Pipeline Template] \label{def:template} - Given a service pipeline G(\V,\E), a pipeline template $G^{\myLambda,\myGamma}$(V,E,\myLambda,\myGamma) is a direct acyclic graph with two labeling functions: - \begin{enumerate}[label=\textit{\roman*}] - \item \myLambda that assigns a label \myLambda(\vi{i}), corresponding to a set \P{i} of policies $p_j$ to be satisfied by service $s_i$ represented by \vi{i}, for each vertex $\vi{i}\in\V_S$; - \item \myGamma that assigns a label \myGamma(\vi{i}), corresponding to the functional description $F_i$ of service $s_i$ represented by \vi{i}, for each vertex $\vi{i}\in\V_S$. + Given a service pipeline G(\V,\E), a pipeline template $G^{\myLambda,\myGamma}$(V,E,\myLambda,\myGamma) is a direct acyclic graph extedend with two annotation functions: + \begin{enumerate}%[label=\textit{\roman*}] + \item \emph{Data Protection Annotation} \myLambda that assigns a label \myLambda(\vi{i}) to each vertex $\vi{i}\in\V_S$. Label \myLambda(\vi{i}) corresponds to a set \P{i} of policies $p_j$ to be satisfied by service $s_i$ represented by \vi{i}, ; + \item \emph{Functional Annotation} \myGamma that assigns a label \myGamma(\vi{i}) to each vertex $\vi{i}\in\V_S$. Label \myGamma(\vi{i}) corresponds to the functional description $F_i$ of service $s_i$ represented by \vi{i}. \end{enumerate} \end{definition} -We note that, at this stage, the template is not yet linked to any service. -We also note that policies $p_j$$\in$\P{i} annotated with \myLambda(\vi{i}) are ORed, meaning that the access decision is positive if at least one policy $p_j$ is evaluated to \emph{true}. +We note that, at this stage, the template is not yet linked to any service. We also note that policies $p_j$$\in$\P{i} in \myLambda(\vi{i}) are combined using logical OR, meaning that the access decision is positive if at least one policy $p_j$ evaluates to \emph{true}. %We also note that functional description $F_i$ includes the specific data transformation triggered as the result of a service execution. An example of pipeline template is depicted in \cref{fig:service_composition_template} @@ -86,7 +85,7 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition} \draw[->] (s7) -- (s8); \end{tikzpicture} - \caption{Pipeline Template} + \caption{Pipeline Template\hl{METTIAMO PEDICI A LAMBDA E GAMMA}} \label{fig:service_composition_template} \end{figure} @@ -97,46 +96,75 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition} A \emph{Policy Condition pc} is a Boolean expression of the form $($\emph{attr\_name} op \emph{attr\_value}$)$, with op$\in$\{$<$,$>$,$=$,$\neq$,$\leq$,$\geq$\}, \emph{attr\_name} an attribute label, and \emph{attr\_value} the corresponding attribute value. \end{definition} - An access control policy then specifies who (\emph{subject}) can access what (\emph{object}) with action (\emph{action}), in a specific context (\emph{environment}) and under specific obligations (\emph{data transformation}), as formally defined below. - + Built on policy conditions, an access control policy is then defined as follows. + \begin{definition}[Policy]\label{def:policy_rule} - A {\it policy p}$\in$\P{} is 5-uple $<$\textit{subj}, \textit{obj}, \textit{act}, \textit{env}, \textit{\TP}$>$, where: - \begin{description} - \item Subject \textit{subj} defines a service $s_i$ issuing an access request to perform an action on an object. It is of the form $<$\emph{id}, \{$pc_i$\}$>$, where \emph{id} defines a class of services (e.g., classifier), and \{$pc_i$\} is a set of \emph{Policy Conditions} on the subject, as defined in Definition \ref{def:policy_cond}. For instance, $<$\emph{service},\{(classifier $=$ "SVM")\}$>$ refers to a service providing a SVM classifier. We note that \textit{subj} can also specify conditions on the service owner, such as, $<$\emph{service},\{(owner\_location $=$ "EU")\}$>$ and on the service user, such as, $<$\emph{service},\{(service\_user\_role $=$ "DOC Director")\}$>$. + A {\it policy p}$\in$\P{} is 5-uple $<$\textit{subj}, \textit{obj}, \textit{act}, \textit{env}, \textit{\TP}$>$, which specifies who (\emph{subject}) can access what (\emph{object}) with action (\emph{action}), in a specific context (\emph{environment}) and under specific obligations (\emph{data transformation}). + \end{definition} + + More in detail, \textit{subject subj} defines a service $s_i$ issuing an access request to perform an action on an object. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, $<$\{(classifier $=$ "SVM")\}$>$ specifies a service providing a SVM classifier. We note that \textit{subj} can also specify conditions on the service owner ($<$\{(owner\_location $=$ "EU")\}$>$) and the service user ($<$\emph{service},\{(service\_user\_role $=$ "DOC Director")\}$>$). - \item Object \textit{obj} defines any data whose access is governed by the policy. It is of the form $<$\emph{type}, \{$pc_i$\}$>$, where: \emph{type} defines the type of object, such as a file (e.g., a video, text file, image, etc.), a SQL or noSQL database, a table, a column, a row, or a cell of a table, and \{$pc_i$\} is a set of \emph{Policy Conditions} defined on the object's attributes. For instance, $<$\emph{dataset},\{(region $=$ CT)\}$>$ refers to a dataset whose region is Connecticut. + %\item + \textit{Object obj} defines any data whose access is governed by the policy. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. + %It can specify the \emph{type} of object, such as a file (e.g., a video, text file, image, etc.), a SQL or noSQL database, a table, a column, a row, or a cell of a table, or any other characteristics of the data. + For instance, $<$\{(type $=$ "dataset")\}, \{(region $=$ CT)\}$>$ refers to an object of type dataset and whose region is Connecticut. - \item Action \textit{act} defines any operations that can be performed within a big data environment, from traditional atomic operations on databases (e.g., CRUD operations varying depending on the data model) to coarser operations, such as an Apache Spark Direct Acyclic Graph (DAG), Hadoop MapReduce, an analytics function call, or an analytics pipeline. + %\item + \textit{Action act} defines any operations that can be performed within a big data environment, from traditional atomic operations on databases (e.g., CRUD operations) to coarser operations, such as an Apache Spark Direct Acyclic Graph (DAG), Hadoop MapReduce, an analytics function call, an analytics pipeline. - \item Environment \textit{env} defines a set of conditions on contextual attributes, such as time of the day, location, IP address, risk level, weather condition, holiday/workday, emergency. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, $<$\emph{env},\{(time $=$ "night")\}$>$ refers to a policy that is applicable only at night. + %\item + \textit{Environment env} defines a set of conditions on contextual attributes, such as time of the day, location, IP address, risk level, weather condition, holiday/workday, emergency. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, $<$\{(time $=$ "night")\}$>$ refers to a policy that is applicable only at night. - \item Data Transformation \textit{\TP} defines a set of security and privacy-aware transformations on \textit{obj}, which must be enforced before any access to data. - Transformations focus on data protection, as well as compliance to regulations and standards, in addition to simple format conversions. - For instance, let us define three transformations that can be applied to the \cref{tab:dataset}: + %\item + \textit{Data Transformation \TP} defines a set of security and privacy-aware transformations on \textit{obj}, which must be enforced before any access to data. Transformations focus on data protection, as well as compliance to regulations and standards, in addition to simple format conversions. For instance, let us define three transformations that can be applied to \cref{tab:dataset}: \begin{enumerate*}[label=\roman*)] - \item \emph{level0} (\tp{0}): no anonymization is carried out; - \item \emph{level1} (\tp{1}): The data has been partially anonymized with only the first name and last name being anonymized; - \item \emph{level2} (\tp{2}): The data has been fully anonymized with the first name, last name, identifier, and age being anonymized. + \item \emph{level0} (\tp{0}): no anonymization; + \item \emph{level1} (\tp{1}): partial anonymization with only the first name and last name being anonymized; + \item \emph{level2} (\tp{2}): full anonymization with the first name, last name, identifier, and age being anonymized. \end{enumerate*} - \end{description} - \end{definition} + %\end{description} + + % Each component of a policy \emph{p} is further defined + % \begin{description} + % \item + % More in detail, \textit{subject subj} defines a service $s_i$ issuing an access request to perform an action on an object. It is of the form $<$\emph{id}, \{$pc_i$\}$>$, where \emph{id} defines a class of services (e.g., classifier), and \{$pc_i$\} is a set of \emph{Policy Conditions} on the subject, as defined in Definition \ref{def:policy_cond}. For instance, $<$\emph{service},\{(classifier $=$ "SVM")\}$>$ refers to a service providing a SVM classifier. We note that \textit{subj} can also specify conditions on the service owner, such as, $<$\emph{service},\{(owner\_location $=$ "EU")\}$>$ and on the service user, such as, $<$\emph{service},\{(service\_user\_role $=$ "DOC Director")\}$>$. + + % %\item + % \textit{Object obj} defines any data whose access is governed by the policy. It is of the form $<$\emph{type}, \{$pc_i$\}$>$, where: \emph{type} defines the type of object, such as a file (e.g., a video, text file, image, etc.), a SQL or noSQL database, a table, a column, a row, or a cell of a table, and \{$pc_i$\} is a set of \emph{Policy Conditions} defined on the object's attributes. For instance, $<$\emph{dataset},\{(region $=$ CT)\}$>$ refers to a dataset whose region is Connecticut. + + % %\item + % \textit{Action act} defines any operations that can be performed within a big data environment, from traditional atomic operations on databases (e.g., CRUD operations) to coarser operations, such as an Apache Spark Direct Acyclic Graph (DAG), Hadoop MapReduce, an analytics function call, an analytics pipeline. + + % %\item + % \textit{Environment env} defines a set of conditions on contextual attributes, such as time of the day, location, IP address, risk level, weather condition, holiday/workday, emergency. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, $<$\emph{env},\{(time $=$ "night")\}$>$ refers to a policy that is applicable only at night. + + % %\item + % \textit{Data Transformation \TP} defines a set of security and privacy-aware transformations on \textit{obj}, which must be enforced before any access to data. Transformations focus on data protection, as well as compliance to regulations and standards, in addition to simple format conversions. For instance, let us define three transformations that can be applied to \cref{tab:dataset}: + % \begin{enumerate*}[label=\roman*)] + % \item \emph{level0} (\tp{0}): no anonymization; + % \item \emph{level1} (\tp{1}): partial anonymization with only the first name and last name being anonymized; + % \item \emph{level2} (\tp{2}): full anonymization with the first name, last name, identifier, and age being anonymized. + % \end{enumerate*} + % %\end{description} + + + To conclude, access control policies $p_j$$\in$\P{i} annotating vertex \vi{i} in a pipeline template $G^{\myLambda,\myGamma}$ filters out those candidate services $s$$\in$$S^c$ that do not match data protection requirements. Specifically, each policy $p_j$$\in$\P{i} verifies whether a candidate service $s$$\in$$S^c$ for vertex \vi{i} is compatible with data protection requirements in \P{i} (\myLambda(\vi{i})). Policy evaluation matches the profile \profile\ of a candidate service $s$$\in$$S^c$ with the policy conditions in each $p_j$$\in$\P{i}. If the credentials and declarations, defined as a set of attributes in the form (\emph{name}, \emph{value}), in the candidate service profile fails to meet the policy conditions, meaning that no policies $p_j$ evaluates to \emph{true}, the service is discarded; otherwise it is added to the set $S'$ of compatible service, which is used in Section~\ref{sec:instance} to generate the pipeline instance $G'$. No policy enforcement is done at this stage. - Access control policies $p_j$$\in$\P{i} annotating vertex \vi{i} in a pipeline template $G^{\myLambda,\myGamma}$ are used to filter out those candidate services $s$$\in$$S^c$ that do not match data protection requirements. Specifically, each policy $p_j$$\in$\P{i} is evaluated to verify whether a candidate service $s$$\in$$S^c$ for vertex \vi{i} is compatible with data protection requirements in \P{i} (\myLambda(\vi{i})). Policy evaluation matches the profile \profile\ of candidate service $s$$\in$$S^c$ with the policy conditions in each $p_j$$\in$\P{i}. If the credentials and declarations, defined as a set of attributes in the form (\emph{name}, \emph{value}), in the candidate service profile fails to meet the policy conditions, meaning that no policies $p_j$ are evaluated to \emph{true}, the service is discarded; otherwise it is added to the set $S'$ of compatible service, which is used in Section~\ref{sec:instance} to generate the pipeline instance $G'$. No policy enforcement is done at this stage. \subsection{Functional Annotations}\label{sec:funcannotation} - A proper data management approach must track functional data manipulations across the entire pipeline execution, defining the functional requirements of each service operating on data. + A proper data management must track functional data manipulations across the entire pipeline execution, defining the functional requirements of each service operating on data. To this aim, each vertex \vi{i}$\in\V_S$ is annotated with a label \myGamma(\vi{i}), corresponding to the functional description $F_i$ of the service $s_i$ represented by \vi{i}. - $F_i$ describes the functional requirements on the corresponding service $s_i$, such as API, inputs, expected outputs. + $F_i$ describes the functional requirements on the corresponding service $s_i$, such as API, inputs, expected outputs. %The latter is modeled as a functional transformation function \TF\ that is applied to the data when executing service $s_i$. \TF\ has a twofold role: %\begin{enumerate}[label=\roman*)] % \item it contains the functional requirements that the service must satisfy, in terms of expected input, expected output, prototype and other functional aspects. - It also specifies a set \TF{} of data transformation functions \tf{i}, possibly triggered during execution of the connected service $s_i$. - - Each $\tf{i}$$\in$$\TF{}$ can be of different types as follows: -\begin{enumerate*}[label=\textit{\roman*})] - \item an empty function \tf{\epsilon} that applies no transformation or processing on the data; - \item an additive function \tf{a} that expands the amount of data received, for example, by integrating data from other sources; - \item a transformation function \tf{t} that transforms some records in the dataset without altering the domain; - \item a transformation function \tf{d} (out of the scope of this work) that changes the domain of the data by applying, for instance, PCA or K-means. + It also specifies a set \TF{} of data transformation functions \tf{i}, which may be triggered during the execution of the corresponding service $s_i$. + + Function $\tf{i}$$\in$$\TF{}$ can be: + \begin{enumerate*}[label=\textit{\roman*})] + \item an empty function \tf{\epsilon} that applies no transformation or processing on the data; + \item an additive function \tf{a} that expands the amount of data received, for example, by integrating data from other sources; + \item a transformation function \tf{t} that transforms some records in the dataset without altering the domain; + \item a transformation function \tf{d} (out of the scope of this work) that changes the domain of the data by applying, for instance, PCA or K-means. \end{enumerate*} For simplicity but with no loss of generality, we assume that all candidate services meet functional annotation \F{} and that \TF{}=\tf{}. As a consequence, all candidate services apply the same transformation to data during execution. diff --git a/pipeline_template_example.tex b/pipeline_template_example.tex index 4f6c229..07ba72f 100644 --- a/pipeline_template_example.tex +++ b/pipeline_template_example.tex @@ -43,29 +43,28 @@ \subsection{Example}\label{sec:example_template} Data protection annotations \myLambda(\vi{1}), \myLambda(\vi{2}), \myLambda(\vi{3}) refer to policy \p{0} with an empty transformation \tp{0}. Functional requirement \F{1}, \F{2}, \F{3} prescribes a URI as input and the corresponding dataset as output. -The second stage consists of vertex \vi{4}, -merging the three datasets obtained stage 1. Data protection annotation \myLambda(\vi{4}) refers to policies \p{1} and \p{2}, which apply different data transformations depending on the relation between the dataset and service owners. +The second stage consists of vertex \vi{4}, merging the three datasets obtained at the first stage. Data protection annotation \myLambda(\vi{4}) refers to policies \p{1} and \p{2}, which apply different data transformations depending on the relation between the dataset and the service owner. % 2° NODO % -If the service owner is also the dataset owner (\pone), the dataset is not anonymized (\tp{0}). We note that if the service owner has no partner relationship with the dataset owner, no policies apply. -If the service owner is a partner of the dataset owner (\ptwo), the dataset is anonymized at level $l_1$ (\tp{1}). +If the service owner is also the dataset owner (\pone), the dataset is not anonymized (\tp{0}). If the service owner is a partner of the dataset owner (\ptwo), the dataset is anonymized at level $l_1$ (\tp{1}). If the service owner has no partner relationship with the dataset owner, no policies apply. %if the service owner is neither the dataset owner nor a partner of the dataset owner (\pthree), the dataset is anonymized level2 (\tp{2}). Functional requirement \F{4} prescribes $n$ datasets as input and the merged dataset as output. % 3° NODO % The third stage consists of vertex \vi{5} for data analysis. -Data protection annotation \myLambda(\vi{5}) refers to policies \p{1} and \p{2}, as for stage 2. +Data protection annotation \myLambda(\vi{5}) refers to policies \p{1} and \p{2}, as for the second stage. % The logic remains consistent: % if the service profile matches with the data owner (\pone), \p{1} matches and level0 anonymization is applied (\tp{0}); % if the service profile matches with a partner of the owner (\ptwo), \p{2} matches and level1 anonymization is applied (\tp{1}); % if the service profile doesn't match with a partner nor with the owner (\pthree), \p{3} matches and level2 anonymization is applied (\tp{2}). Functional requirement \F{5} prescribes a dataset as input and the results of the data analysis as output. % 4° NODO % +\hl{IL PARAGRAFO CHE SEGUE E' SBAGLIATO?} Data protection annotation \myLambda(\vi{5}) refers to policy \p{4} with data transformation \tp{2}, that is, anonymization level $l_2$ to prevent personal identifiers from entering into the machine learning algorithm/model. Functional requirement \F{6} prescribes a dataset as input, and the trained model and a set of inferences as output. % 5° NODO % -The fourth stage consists of vertex \vi{6}, managing data storage. Data protection annotation \myLambda(\vi{6}) refers to policies \p{5} and \p{6}, -which apply different data transformations depending on the relation between the dataset and service region. +\hl{I LIVELLI DI ANONIMIZZAZIONI NON SONO CONSISTENTI CON TABELLA 2} +The fourth stage consists of vertex \vi{6}, managing data storage. Data protection annotation \myLambda(\vi{6}) refers to policies \p{5} and \p{6}, which apply different data transformations depending on the relation between the dataset and the service region. If the service region is the dataset origin ($\langle service\_region=dataset\_origin"\rangle$) (\p{5}), the dataset is anonymized at level $l_1$ (\tp{1}). If the service region is in a partner region ($\langle service,region={NY,NH}"\rangle$) (\p{6}), the dataset is anonymized at level $l_2$ (\tp{2}). Functional requirement \F{7} prescribes a dataset as input and the URI of the stored data as output.