-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
0f4e9a4
commit 6d5aecb
Showing
623 changed files
with
76,183 additions
and
110,686 deletions.
There are no files selected for viewing
Binary file added
BIN
+475 KB
major review/Balancing Data Quality and Protection in Distributed Big Data Pipelines.pdf
Binary file not shown.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
\section{Pipeline Template}\label{sec:template} | ||
Our approach integrates data protection and data management into the service pipeline using annotations. To this aim, we extend the service pipeline in \cref{def:pipeline} with: \emph{i)} data protection annotations that express transformations on data, ensuring compliance with data protection requirements, \emph{ii)} functional annotations that express data manipulations carried out during service execution. | ||
These annotations enable the implementation of an advanced data lineage, tracking the entire data lifecycle by monitoring changes that result from functional service execution and data protection requirements. | ||
|
||
In the following, we first introduce the annotated service pipeline, called pipeline template (Section \ref{sec:templatedefinition}). We then present both functional annotations (Section \ref{sec:funcannotation}) and data protection annotations (Section \ref{sec:nonfuncannotation}), providing an example of a pipeline template in the context of the reference scenario. | ||
|
||
\subsection{Pipeline Template Definition}\label{sec:templatedefinition} | ||
Given the service pipeline in Definition~\ref{def:pipeline}, we use annotations to express data protection requirements to be enforced on data and functional requirements on the services to be integrated in the pipeline. Each service vertex in the service pipeline is labeled with two mapping functions forming a pipeline template: | ||
\begin{enumerate*}[label=\textit{\roman*})] | ||
\item an annotation function \myLambda:$\V_S\rightarrow$\P{} that associates a set of data protection requirements, in the form of policies $p$$\in$\P{}, with each vertex \vi{i}$\in$$\V_S$; | ||
\item an annotation function \myGamma:$\V_S\rightarrow$\F{} that associates a functional service description $F_i\in\F{}$ with each vertex \vi{i}$\in$$\V_S$. | ||
\end{enumerate*} | ||
|
||
The template is formally defined as follows. | ||
|
||
\vspace{0.5em} | ||
|
||
\begin{definition}[Pipeline Template] \label{def:template} | ||
Given a service pipeline G(\V,\E), a pipeline template \tChartFunction is a direct acyclic graph extended with two annotation functions: | ||
\begin{enumerate}%[label=\textit{\roman*}] | ||
\item \emph{Data Protection Annotation} \myLambda that assigns a label \myLambda(\vi{i}) to each vertex $\vi{i}\in\V_S$. Label \myLambda(\vi{i}) corresponds to a set \P{i} of policies $p_j$ to be satisfied by service $s_i$ represented by \vi{i}; | ||
\item \emph{Functional Annotation} \myGamma that assigns a label \myGamma(\vi{i}) to each vertex $\vi{i}\in\V_S$. Label \myGamma(\vi{i}) corresponds to the functional description $F_i$ of service $s_i$ represented by \vi{i}. | ||
\end{enumerate} | ||
\end{definition} | ||
|
||
\vspace{0.5em} | ||
|
||
We note that, at this stage, the template is not yet linked to any service. We also note that policies $p_j$$\in$\P{i} in \myLambda(\vi{i}) are combined using logical OR, meaning that the access decision is positive if at least one policy $p_j$ evaluates to \emph{true}. The pipeline template of the service pipeline of \cref{fig:reference_scenario} is depicted in \cref{fig:service_composition_template}. | ||
|
||
\begin{figure}[ht!] | ||
\centering | ||
\newcommand{\function}[1]{$\ensuremath{\myLambda_{#1},\myGamma_{#1}}$} | ||
\begin{tikzpicture}[scale=0.9] | ||
% vertexes | ||
\node[draw, circle, fill,text=white,minimum size=1 ] (sr) at (0,0) {}; | ||
\node[draw, circle, plus,minimum size=1.5em] (plus) at (1.5,0) {}; | ||
\node[draw, circle] (s1) at (3,1.7) {$\vi{3}$}; | ||
\node[draw, circle] (s2) at (3,-1.7) {$\vi{1}$}; | ||
\node[draw, circle] (s3) at (3,0) {$\vi{2}$}; | ||
|
||
|
||
\node[draw, circle] (s4) at (4.5,0) {$\vi{4}$}; | ||
|
||
\node[draw, circle] (s5) at (6,0) {$\vi{5}$}; | ||
|
||
\node[draw, circle] (s7) at (7.5,0) {$\vi{6}$}; | ||
\node[draw, circle] (s8) at (9,0) {$\vi{7}$}; | ||
% Text on top | ||
\node[above] at (sr.north) {$\vi{r}$}; | ||
|
||
\node[above] at (s1.north) {\function{3}}; | ||
|
||
\node[above] at (s2.north) {\function{1}}; | ||
\node[above] at (s3.north) {\function{2}}; | ||
\node[above] at (s4.north) {\function{4}}; | ||
\node[above] at (s5.north) {\function{5}}; | ||
\node[above] at (s7.north) {\function{6}}; | ||
\node[above] at (s8.north) {\function{7}}; | ||
% Connection | ||
|
||
\draw[->] (sr) -- (plus); | ||
\draw[->] (plus) -- (s1); | ||
\draw[->] (plus) -- (s2); | ||
\draw[->] (plus) -- (s3); | ||
|
||
\draw[->] (s1) -- (s4); | ||
\draw[->] (s2) -- (s4); | ||
\draw[->] (s3) -- (s4); | ||
\draw[->] (s4) -- (s5); | ||
\draw[->] (s5) -- (s7); | ||
\draw[->] (s7) -- (s8); | ||
|
||
\end{tikzpicture} | ||
\caption{Pipeline Template} | ||
\label{fig:service_composition_template} | ||
\end{figure} | ||
|
||
\subsection{Data Protection Annotation}\label{sec:nonfuncannotation} | ||
Data Protection Annotation \myLambda\ expresses data protection requirements in the form of access control policies. We consider an attribute-based access control model that offers flexible fine-grained authorization and adapts its standard key components to address the unique characteristics of a big data environment. Access requirements are expressed in the form of policy conditions that are defined as follows. | ||
|
||
\vspace{0.5em} | ||
|
||
\begin{definition}[Policy Condition]\label{def:policy_cond} | ||
A \emph{Policy Condition pc} is a Boolean expression of the form $($\emph{attr\_name} op \emph{attr\_value}$)$, with op$\in$\{$<$,$>$,$=$,$\neq$,$\leq$,$\geq$\}, \emph{attr\_name} an attribute label, and \emph{attr\_value} the corresponding attribute value. | ||
\end{definition} | ||
|
||
\vspace{0.5em} | ||
|
||
Built on policy conditions, an access control policy is then defined as follows. | ||
\vspace{0.5em} | ||
\begin{definition}[Policy]\label{def:policy_rule} | ||
A {\it policy p}$\in$\P{} is 5-uple $<$\textit{subj}, \textit{obj}, \textit{act}, \textit{env}, \textit{\TP}$>$ that specifies who (\emph{subject}) can access what (\emph{object}) with action (\emph{action}), in a specific context (\emph{environment}) and under specific obligations (\emph{data transformation}). | ||
\end{definition} | ||
|
||
\vspace{0.5em} | ||
|
||
More in detail, \textit{subject subj} specifies a service $s_i$ issuing an access request to perform an action on an object. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, (classifier$=$``SVM'') specifies a service providing a SVM classifier. We note that \textit{subj} can also specify conditions on the service owner (\textit{e.g.}, owner\_location$=$``EU'') and the service user (\textit{e.g.}, service\_user\_role$=$``DOC Director''). | ||
|
||
\textit{Object obj} defines the data governed by the access policy. In this case, it is a set \{$pc_i$\} of \emph{Policy Conditions} on the object's attributes. | ||
For instance, \{(type$=$``dataset''), (region$=$``CT'')\} refers to an object of type dataset and whose region is Connecticut. | ||
|
||
\textit{Action act} specifies the operations that can be performed within a big data environment, from traditional atomic operations on databases (e.g., CRUD operations) to coarser operations, such as an Apache Spark Direct Acyclic Graph (DAG), Hadoop MapReduce, an analytics function call, and an analytics pipeline. | ||
|
||
\textit{Environment env} defines a set of conditions on contextual attributes, such as time of the day, location, IP address, risk level, weather condition, holiday/workday, and emergency. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, (time$=$``night") refers to a policy that is applicable only at night. | ||
|
||
\textit{Data Transformation \TP} defines a set of security and privacy-aware transformations on \textit{obj} that must be enforced before any access to data is given. Transformations focus on data protection, as well as on compliance to regulations and standards, in addition to simple format conversions. For instance, let us define three transformations that can be applied to the dataset in \cref{tab:dataset}, each performing different levels of anonymization: | ||
\begin{enumerate*}[label=\roman*)] | ||
\item level \emph{l0} (\tp{0}): no anonymization; | ||
\item level \emph{l1} (\tp{1}): partial anonymization with only first and last name being anonymized; | ||
\item level \emph{l2} (\tp{2}): full anonymization with first name, last name, identifier and age being anonymized. | ||
\end{enumerate*} | ||
|
||
Access control policies $p_j$$\in$\P{i} annotating a vertex \vi{i} in a pipeline template $G^{\myLambda,\myGamma}$ specify the data protection requirements that a candidate service must fulfill to be selected in the pipeline instance. Section~\ref{sec:instance} describes the selection process and the pipeline instance generation. | ||
|
||
\subsection{Functional Annotations}\label{sec:funcannotation} | ||
A proper data management approach must track functional data manipulations across the entire pipeline execution, defining the functional requirements of each service operating on data. | ||
To this aim, each vertex \vi{i}$\in\V_S$ is annotated with a label \myGamma(\vi{i}), corresponding to the functional description $F_i$ of the service $s_i$ represented by \vi{i}. | ||
$F_i$ describes the functional requirements, such as API, inputs, expected outputs. | ||
It also specifies a set \TF{} of data transformation functions \tf{i}, which can be triggered during the execution of the corresponding service $s_i$. | ||
|
||
Function $\tf{i}$$\in$$\TF{}$ can be: | ||
\begin{enumerate*}[label=\textit{\roman*})] | ||
\item an empty function \tf{\epsilon} that applies no transformation or processing on the data; | ||
\item an additive function \tf{a} that expands the amount of data received, for example, by integrating data from other sources; | ||
\item a transformation function \tf{t} that transforms some records in the dataset without altering the domain; | ||
\item a transformation function \tf{d} (out of the scope of this work) that changes the domain of the data by applying, for instance, the PCA. | ||
\end{enumerate*} | ||
|
||
For simplicity but with no loss of generality, we assume that all candidate services meet functional annotation \F{} and that \TF{}=\tf{}. As a consequence, all candidate services apply the same transformation to the data during the pipeline execution. |
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Oops, something went wrong.