From d9ec7f5b73d52f7909c7a8456f56a2bcadd00d34 Mon Sep 17 00:00:00 2001 From: Claudio Ardagna Date: Mon, 29 Apr 2024 12:25:21 +0200 Subject: [PATCH] Section 3 - Claudio --- pipeline_template.tex | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/pipeline_template.tex b/pipeline_template.tex index 9147dc3..ca4acb8 100644 --- a/pipeline_template.tex +++ b/pipeline_template.tex @@ -1,7 +1,7 @@ \section{Pipeline Template}\label{sec:template} Our approach integrates data protection and data management into the service pipeline using annotations. To this aim, we extend the service pipeline in \cref{def:pipeline} with: \emph{i)} data protection annotations to express transformations on data, ensuring compliance with data protection requirements, \emph{ii)} functional annotations to express data manipulations carried out during services execution. -These annotations enable the implementation of advanced data lineage, tracking the entire data lifecycle by monitoring changes that result from functional service execution and data protection requirements. +These annotations enable the implementation of an advanced data lineage, tracking the entire data lifecycle by monitoring changes that result from functional service execution and data protection requirements. In the following, we first introduce the annotated service pipeline, called pipeline template (Section \ref{sec:templatedefinition}). We then present functional annotations (Section \ref{sec:funcannotation}) and data protection annotations (Section \ref{sec:nonfuncannotation}). We finally provide an example of a pipeline template (Section \ref{sec:example_template}). @@ -99,24 +99,24 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition} Built on policy conditions, an access control policy is then defined as follows. \begin{definition}[Policy]\label{def:policy_rule} - A {\it policy p}$\in$\P{} is 5-uple $<$\textit{subj}, \textit{obj}, \textit{act}, \textit{env}, \textit{\TP}$>$, which specifies who (\emph{subject}) can access what (\emph{object}) with action (\emph{action}), in a specific context (\emph{environment}) and under specific obligations (\emph{data transformation}). + A {\it policy p}$\in$\P{} is 5-uple $<$\textit{subj}, \textit{obj}, \textit{act}, \textit{env}, \textit{\TP}$>$ that specifies who (\emph{subject}) can access what (\emph{object}) with action (\emph{action}), in a specific context (\emph{environment}) and under specific obligations (\emph{data transformation}). \end{definition} More in detail, \textit{subject subj} defines a service $s_i$ issuing an access request to perform an action on an object. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, $<$\{(classifier $=$ "SVM")\}$>$ specifies a service providing a SVM classifier. We note that \textit{subj} can also specify conditions on the service owner ($<$\{(owner\_location $=$ "EU")\}$>$) and the service user ($<$\emph{service},\{(service\_user\_role $=$ "DOC Director")\}$>$). %\item - \textit{Object obj} defines any data whose access is governed by the policy. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. + \textit{Object obj} defines those data whose access is governed by the policy. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. %It can specify the \emph{type} of object, such as a file (e.g., a video, text file, image, etc.), a SQL or noSQL database, a table, a column, a row, or a cell of a table, or any other characteristics of the data. For instance, $<$\{(type $=$ "dataset")\}, \{(region $=$ CT)\}$>$ refers to an object of type dataset and whose region is Connecticut. %\item - \textit{Action act} defines any operations that can be performed within a big data environment, from traditional atomic operations on databases (e.g., CRUD operations) to coarser operations, such as an Apache Spark Direct Acyclic Graph (DAG), Hadoop MapReduce, an analytics function call, an analytics pipeline. + \textit{Action act} defines those operations that can be performed within a big data environment, from traditional atomic operations on databases (e.g., CRUD operations) to coarser operations, such as an Apache Spark Direct Acyclic Graph (DAG), Hadoop MapReduce, an analytics function call, and an analytics pipeline. %\item \textit{Environment env} defines a set of conditions on contextual attributes, such as time of the day, location, IP address, risk level, weather condition, holiday/workday, emergency. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, $<$\{(time $=$ "night")\}$>$ refers to a policy that is applicable only at night. %\item - \textit{Data Transformation \TP} defines a set of security and privacy-aware transformations on \textit{obj}, which must be enforced before any access to data. Transformations focus on data protection, as well as compliance to regulations and standards, in addition to simple format conversions. For instance, let us define three transformations that can be applied to \cref{tab:dataset}: + \textit{Data Transformation \TP} defines a set of security and privacy-aware transformations on \textit{obj}, which must be enforced before any access to data. Transformations focus on data protection, as well as compliance to regulations and standards, in addition to simple format conversions. For instance, let us define three transformations that can be applied to the dataset in \cref{tab:dataset}: \begin{enumerate*}[label=\roman*)] \item \emph{level0} (\tp{0}): no anonymization; \item \emph{level1} (\tp{1}): partial anonymization with only the first name and last name being anonymized; @@ -148,24 +148,24 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition} % %\end{description} - To conclude, access control policies $p_j$$\in$\P{i} annotating vertex \vi{i} in a pipeline template $G^{\myLambda,\myGamma}$ filters out those candidate services $s$$\in$$S^c$ that do not match data protection requirements. Specifically, each policy $p_j$$\in$\P{i} verifies whether a candidate service $s$$\in$$S^c$ for vertex \vi{i} is compatible with data protection requirements in \P{i} (\myLambda(\vi{i})). Policy evaluation matches the profile \profile\ of a candidate service $s$$\in$$S^c$ with the policy conditions in each $p_j$$\in$\P{i}. If the credentials and declarations, defined as a set of attributes in the form (\emph{name}, \emph{value}), in the candidate service profile fails to meet the policy conditions, meaning that no policies $p_j$ evaluates to \emph{true}, the service is discarded; otherwise it is added to the set $S'$ of compatible service, which is used in Section~\ref{sec:instance} to generate the pipeline instance $G'$. No policy enforcement is done at this stage. + To conclude, access control policies $p_j$$\in$\P{i} annotating vertex \vi{i} in a pipeline template $G^{\myLambda,\myGamma}$ filters out those candidate services $s$$\in$$S^c$ that do not match data protection requirements. Specifically, each policy $p_j$$\in$\P{i} verifies whether a candidate service $s$$\in$$S^c$ for vertex \vi{i} is compatible with data protection requirements in \myLambda(\vi{i}). Policy evaluation matches the profile \profile\ of a candidate service $s$$\in$$S^c$ with the policy conditions in each $p_j$$\in$\P{i}. If the credentials and declarations, defined as a set of attributes in the form (\emph{name}, \emph{value}), in the candidate service profile fails to meet the policy conditions, meaning that no policies $p_j$ evaluates to \emph{true}, the service is discarded; otherwise it is added to the set $S'$ of compatible service, which is used in Section~\ref{sec:instance} to generate the pipeline instance $G'$. No policy enforcement is done at this stage. \subsection{Functional Annotations}\label{sec:funcannotation} - A proper data management must track functional data manipulations across the entire pipeline execution, defining the functional requirements of each service operating on data. + A proper data management approach must track functional data manipulations across the entire pipeline execution, defining the functional requirements of each service operating on data. To this aim, each vertex \vi{i}$\in\V_S$ is annotated with a label \myGamma(\vi{i}), corresponding to the functional description $F_i$ of the service $s_i$ represented by \vi{i}. - $F_i$ describes the functional requirements on the corresponding service $s_i$, such as API, inputs, expected outputs. + $F_i$ describes the functional requirements on the corresponding service $s_i$, such as API, inputs, expected outputs. %The latter is modeled as a functional transformation function \TF\ that is applied to the data when executing service $s_i$. \TF\ has a twofold role: %\begin{enumerate}[label=\roman*)] % \item it contains the functional requirements that the service must satisfy, in terms of expected input, expected output, prototype and other functional aspects. - It also specifies a set \TF{} of data transformation functions \tf{i}, which may be triggered during the execution of the corresponding service $s_i$. + It also specifies a set \TF{} of data transformation functions \tf{i}, which can be triggered during the execution of the corresponding service $s_i$. Function $\tf{i}$$\in$$\TF{}$ can be: -\begin{enumerate*}[label=\textit{\roman*})] - \item an empty function \tf{\epsilon} that applies no transformation or processing on the data; - \item an additive function \tf{a} that expands the amount of data received, for example, by integrating data from other sources; - \item a transformation function \tf{t} that transforms some records in the dataset without altering the domain; - \item a transformation function \tf{d} (out of the scope of this work) that changes the domain of the data by applying, for instance, PCA or K-means. -\end{enumerate*} - -For simplicity but with no loss of generality, we assume that all candidate services meet functional annotation \F{} and that \TF{}=\tf{}. As a consequence, all candidate services apply the same transformation to data during execution. + \begin{enumerate*}[label=\textit{\roman*})] + \item an empty function \tf{\epsilon} that applies no transformation or processing on the data; + \item an additive function \tf{a} that expands the amount of data received, for example, by integrating data from other sources; + \item a transformation function \tf{t} that transforms some records in the dataset without altering the domain; + \item a transformation function \tf{d} (out of the scope of this work) that changes the domain of the data by applying, for instance, PCA or K-means. + \end{enumerate*} + +For simplicity but with no loss of generality, we assume that all candidate services meet functional annotation \F{} and that \TF{}=\tf{}. As a consequence, all candidate services apply the same transformation to data during the pipeline execution.