Skip to content

Commit

Permalink
claudio
Browse files Browse the repository at this point in the history
  • Loading branch information
cardagna committed Jun 13, 2024
1 parent 4bec415 commit 3eaa14e
Show file tree
Hide file tree
Showing 5 changed files with 27 additions and 129 deletions.
9 changes: 1 addition & 8 deletions introduction.tex
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
\section{Introduction}
The wide success and adoption of cloud infrastructures and their intrinsic multitenancy represent a paradigm shift in the big data scenario, redefining scalability and efficiency in data analytics. Multitenancy enables multiple users to share resources, such as computing power and storage, optimizing their utilization and reducing operational costs. Leveraging cloud infrastructure further enhances flexibility and scalability.
%allowing organizations to dynamically allocate resources based on demand while ensuring seamless access to cutting-edge data analytics tools and services.
%
The flip side of multitenancy is the increased complexity in data governance: the shared model introduces unique security challenges, as tenants may have different security requirements, access levels, and data sensitivity. Adequate measures such as encryption, access control mechanisms, and data anonymization techniques must be implemented to safeguard data against unauthorized access and ensure compliance with regulatory requirements such as GDPR or HIPAA.
%
Expand All @@ -13,25 +12,19 @@ \section{Introduction}
When evaluating a solution meeting these criteria, the following questions naturally arise:
\begin{enumerate}
\item How does a robust data protection policy affect analytics?
%What impact does a robust and strong data protection policy have on analytics?
%How does a strong data protection policy influence analytics?
%\item When considering a pipeline, is it more advantageous to perform data protection transformations at each step rather than filtering all data at the outset?
\item When considering a (big data) pipeline, should data protection be implemented at each pipeline step rather than filtering all data at the outset?
\item In a scenario where a user has the option to choose among various candidate services, how might these choices affect the analytics?
\end{enumerate}

Based on the aforementioned considerations, we propose a data governance framework for modern data-driven pipelines, designed to mitigate privacy and security risks. The primary objective of this framework is to support the selection and assembly of data processing services within the pipeline, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements.
To this aim, each element of the pipeline is \textit{annotated} with \emph{i)} data protection requirements expressing transformation on data and \emph{ii)} functional specifications on services expressing data manipulations carried out during each service execution.
%Each element in the pipeline, there is a catalog of candidate services among which the user running it can choose. Services may be functionally equivalent (i.e., they perform the same tasks), but have different security policies, more or less restrictive depending on the organization or service provider they belong to.
Though applicable to a generic scenario, our data governance approach starts from the assumption that maintaining a larger volume of data leads to higher data quality; as a consequence, its service selection algorithm focuses on maximizing data quality by retaining the maximum amount of information when applying data protection transformations.
%We have a running example on which we conducted experiments to provide an initial answer to the questions highlighted earlier.

The primary contributions of the paper can be summarized as follows:
\begin{enumerate*}
\item Defining a data governance framework supporting selection and assembly of data processing services enriched with meta-data that describes both data protection and functional requirements;
\item Defining a data governance framework supporting selection and assembly of data processing services enriched with metadata that describe both data protection and functional requirements;
\item Proposing a parametric heuristic tailored to address the computational complexity of the NP-hard service selection problem;
\item Evaluating the performance and quality of the algorithm through experiments conducted using the dataset from the running example.
\end{enumerate*}


The remainder of the paper is structured as follows: Section 2, presents our system model, illustrating a reference scenario where data is owned by multiple organizations. Section \ref{sec:template} introduces the pipeline template and describe data protection and functional annotations. Section \ref{sec:instance} describes the process of building a pipeline instance from a pipeline template according to service selection. Section \ref{sec:heuristics} introduces the quality metrics used in service selection and the heuristic solving the service selection problem. Section \ref{sec:experiment} presents our experimental results. Section \ref{sec:related} discusses the state of the art and Section \ref{sec:conclusions} draws our concluding remarks.
12 changes: 2 additions & 10 deletions pipeline_instance.tex
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
\section{Pipeline Instance}\label{sec:instance}
%Given a set of candidate services, a
A \pipelineInstance $\iChartFunction$ instantiates a \pipelineTemplate \tChartFunction by selecting and composing services according to data protection and functional annotations in the template. It is formally defined as follows.
\vspace{0.5em}
\begin{definition}[Pipeline Instance]\label{def:instance}
Expand All @@ -14,7 +13,9 @@ \section{Pipeline Instance}\label{sec:instance}
\item $s'_i$ satisfies functional annotation \myGamma(\vi{i}) in \tChartFunction.
\end{enumerate}
\end{definition}

\vspace{0.5em}

Condition 1 requires that each selected service \sii{i} satisfies the policy requirements \P{i} of the corresponding vertex \vi{i} in the \pipelineTemplate, whereas Condition 2 is needed to preserve the process functionality, as it simply states that each service \sii{i} must satisfy the functional requirements \F{i} of the corresponding vertex \vi{i} in the \pipelineTemplate.

We then define a \emph{pipeline instantiation} function that takes as input a \pipelineTemplate \tChartFunction and a set $S^c$ of candidate services, split in a specific set of services $S^c_{i}$ for each vertex \vi{i}$\in$$\V_S$, and returns as output a \pipelineInstance \iChartFunction. Recall from Section~\ref{sec:funcannotation} that all candidate services meet the functional annotation in the template, meaning that Condition 2 in Definition~\ref{def:instance} is satisfied for all candidate services.
Expand All @@ -37,7 +38,6 @@ \section{Pipeline Instance}\label{sec:instance}
% vertexes
\node[draw, circle, fill,text=white,minimum size=1 ] (sr) at (0,0) {};

% \node[draw, circle] (node2) at (1,0) {$\s{1}$};
\node[draw, circle, plus,minimum size=1.5em] (plus) at (1.5,0) {};

\node[draw, circle] (s2) at (3.5,-2) {$\sii{1}$};
Expand All @@ -57,12 +57,10 @@ \section{Pipeline Instance}\label{sec:instance}
\node[above] at (s3.north) {\function{2}};
\node[above] at (s4.north) {\function{4}};
\node[above] at (s5.north) {\function{5}};
% \node[above] at (s6.north) {\function{}};
\node[above] at (s6.north) {\function{6}};
\node[above] at (s7.north) {\function{7}};
% Connection

% \draw[->] (node2) -- (node3);
\draw[->] (sr) -- (plus);
\draw[->] (plus) -- (s1);
\draw[->] (plus) -- (s2);
Expand All @@ -71,14 +69,8 @@ \section{Pipeline Instance}\label{sec:instance}
\draw[->] (s1) -- (s4);
\draw[->] (s2) -- (s4);
\draw[->] (s3) -- (s4);
% \draw[->] (node6) -- (node65);
% \draw[->] (node65) -- (node7);3
\draw[->] (s4) -- (s5);
\draw[->] (s5) -- (s6);
% \draw[->] (cross) -- (s5);
% \draw[->] (cross) -- (s6);
% \draw[->] (s5) -- (s7);
% \draw[->] (s6) -- (s7);
\draw[->] (s6) -- (s7);

\end{tikzpicture}
Expand Down
42 changes: 6 additions & 36 deletions pipeline_template.tex
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
\section{Pipeline Template}\label{sec:template}
Our approach integrates data protection and data management into the service pipeline using annotations.
To this aim, we extend the service pipeline in \cref{def:pipeline} with: \emph{i)} data protection annotations that also express transformations on data, ensuring compliance with data protection requirements, \emph{ii)} functional annotations to express data manipulations carried out during service execution.
Our approach integrates data protection and data management into the service pipeline using annotations. To this aim, we extend the service pipeline in \cref{def:pipeline} with: \emph{i)} data protection annotations that express transformations on data, ensuring compliance with data protection requirements, \emph{ii)} functional annotations that express data manipulations carried out during service execution.
These annotations enable the implementation of an advanced data lineage, tracking the entire data lifecycle by monitoring changes that result from functional service execution and data protection requirements.

In the following, we first introduce the annotated service pipeline, called pipeline template (Section \ref{sec:templatedefinition}). We then present both functional annotations (Section \ref{sec:funcannotation}) and data protection annotations (Section \ref{sec:nonfuncannotation}), providing an example of a pipeline template in the context of the reference scenario.
Expand All @@ -11,7 +10,6 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition}
\item an annotation function \myLambda:$\V_S\rightarrow$\P{} that associates a set of data protection requirements, in the form of policies $p$$\in$\P{}, with each vertex \vi{i}$\in$$\V_S$;
\item an annotation function \myGamma:$\V_S\rightarrow$\F{} that associates a functional service description $F_i\in\F{}$ with each vertex \vi{i}$\in$$\V_S$.
\end{enumerate*}
%The policies will be intended to guide the enforcement of data protection while the data transformation function will characterize the functional aspect of each vertex.

The template is formally defined as follows.

Expand All @@ -27,18 +25,14 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition}

\vspace{0.5em}

We note that, at this stage, the template is not yet linked to any service. We also note that policies $p_j$$\in$\P{i} in \myLambda(\vi{i}) are combined using logical OR, meaning that the access decision is positive if at least one policy $p_j$ evaluates to \emph{true}.
%We also note that functional description $F_i$ includes the specific data transformation triggered as the result of a service execution.
The pipeline template of the service pipeline of \cref{fig:reference_scenario} is depicted in \cref{fig:service_composition_template}.
We note that, at this stage, the template is not yet linked to any service. We also note that policies $p_j$$\in$\P{i} in \myLambda(\vi{i}) are combined using logical OR, meaning that the access decision is positive if at least one policy $p_j$ evaluates to \emph{true}. The pipeline template of the service pipeline of \cref{fig:reference_scenario} is depicted in \cref{fig:service_composition_template}.

%The next sections better explain the functional and non-functional transformation functions.
\begin{figure}[ht!]
\centering
\newcommand{\function}[1]{$\ensuremath{\myLambda_{#1},\myGamma_{#1}}$}
\begin{tikzpicture}[scale=0.9]
% vertexes
\node[draw, circle, fill,text=white,minimum size=1 ] (sr) at (0,0) {};
% \node[draw, circle] (node2) at (1,0) {$\s{1}$};
\node[draw, circle, plus,minimum size=1.5em] (plus) at (1.5,0) {};
\node[draw, circle] (s1) at (3,1.7) {$\vi{3}$};
\node[draw, circle] (s2) at (3,-1.7) {$\vi{1}$};
Expand All @@ -48,9 +42,6 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition}
\node[draw, circle] (s4) at (4.5,0) {$\vi{4}$};

\node[draw, circle] (s5) at (6,0) {$\vi{5}$};
% \node[draw, circle, cross,minimum size=1.5em] (cross) at (6,0) {};
%\node[draw, circle] (s5) at (7.5,1.2) {$\vi{5}$};
%\node[draw, circle] (s6) at (7.5,-1.2) {$\vi{6}$};

\node[draw, circle] (s7) at (7.5,0) {$\vi{6}$};
\node[draw, circle] (s8) at (9,0) {$\vi{7}$};
Expand All @@ -63,12 +54,10 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition}
\node[above] at (s3.north) {\function{2}};
\node[above] at (s4.north) {\function{4}};
\node[above] at (s5.north) {\function{5}};
% \node[above] at (s6.north) {\function{}};
\node[above] at (s7.north) {\function{6}};
\node[above] at (s8.north) {\function{7}};
% Connection

% \draw[->] (node2) -- (node3);
\draw[->] (sr) -- (plus);
\draw[->] (plus) -- (s1);
\draw[->] (plus) -- (s2);
Expand All @@ -77,15 +66,8 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition}
\draw[->] (s1) -- (s4);
\draw[->] (s2) -- (s4);
\draw[->] (s3) -- (s4);
% \draw[->] (node6) -- (node65);
% \draw[->] (node65) -- (node7);3
\draw[->] (s4) -- (s5);
\draw[->] (s5) -- (s7);
%\draw[->] (s4) -- (cross);
%\draw[->] (cross) -- (s5);
%\draw[->] (cross) -- (s6);
% \draw[->] (s5) -- (s7);
% \draw[->] (s6) -- (s7);
\draw[->] (s7) -- (s8);

\end{tikzpicture}
Expand Down Expand Up @@ -114,38 +96,26 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition}

More in detail, \textit{subject subj} specifies a service $s_i$ issuing an access request to perform an action on an object. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, (classifier$=$``SVM'') specifies a service providing a SVM classifier. We note that \textit{subj} can also specify conditions on the service owner (\textit{e.g.}, owner\_location$=$``EU'') and the service user (\textit{e.g.}, service\_user\_role$=$``DOC Director'').

%\item
\textit{Object obj} defines the data governed by the access policy. In this case, it is a set \{$pc_i$\} of \emph{Policy Conditions} on the object's attributes. %as defined in Definition \ref{def:policy_cond}.
%It can specify the \emph{type} of object, such as a file (e.g., a video, text file, image, etc.), a SQL or noSQL database, a table, a column, a row, or a cell of a table, or any other characteristics of the data.
For instance, \{(type$=$``dataset''), (region$=$CT)\} refers to an object of type dataset and whose region is Connecticut.
\textit{Object obj} defines the data governed by the access policy. In this case, it is a set \{$pc_i$\} of \emph{Policy Conditions} on the object's attributes.
For instance, \{(type$=$``dataset''), (region$=$``CT'')\} refers to an object of type dataset and whose region is Connecticut.

%\item
\textit{Action act} specifies the operations that can be performed within a big data environment, from traditional atomic operations on databases (e.g., CRUD operations) to coarser operations, such as an Apache Spark Direct Acyclic Graph (DAG), Hadoop MapReduce, an analytics function call, and an analytics pipeline.

%\item
\textit{Environment env} defines a set of conditions on contextual attributes, such as time of the day, location, IP address, risk level, weather condition, holiday/workday, and emergency. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, (time$=$``night") refers to a policy that is applicable only at night.

%\item
\textit{Data Transformation \TP} defines a set of security and privacy-aware transformations on \textit{obj} that must be enforced before any access to data is given. Transformations focus on data protection, as well as on compliance to regulations and standards, in addition to simple format conversions. For instance, let us define three transformations that can be applied to the dataset in \cref{tab:dataset}, each performing different levels of anonymization:
\begin{enumerate*}[label=\roman*)]
\item level \emph{l0} (\tp{0}): no anonymization;
\item level \emph{l1} (\tp{1}): partial anonymization with only first and last name being anonymized;
\item level \emph{l2} (\tp{2}): full anonymization with first name, last name, identifier and age being anonymized.
\end{enumerate*}



Access control policies $p_j$$\in$\P{i} annotating a vertex \vi{i} in a pipeline template $G^{\myLambda,\myGamma}$ specify the data protection requirements that candidate service must fulfill to be selected in the pipeline instance. Section~\ref{sec:instance} describes the selection process and pipeline instance generation.

% To conclude, access control policies $p_j$$\in$\P{i} annotating vertex \vi{i} in a pipeline template $G^{\myLambda,\myGamma}$ filters out those candidate services $s$$\in$$S^c$ that do not match data protection requirements. Specifically, each policy $p_j$$\in$\P{i} verifies whether a candidate service $s$$\in$$S^c$ for vertex \vi{i} is compatible with data protection requirements in \myLambda(\vi{i}). Policy evaluation matches the profile \profile\ of a candidate service $s$$\in$$S^c$ with the policy conditions in each $p_j$$\in$\P{i}. If the credentials and declarations, defined as a set of attributes in the form (\emph{name}, \emph{value}), in the candidate service profile fails to meet the policy conditions, meaning that no policies $p_j$ evaluates to \emph{true}, the service is discarded; otherwise it is added to the set $S'$ of compatible service, which is used in Section~\ref{sec:instance} to generate the pipeline instance $G'$. No policy enforcement is done at this stage.
Access control policies $p_j$$\in$\P{i} annotating a vertex \vi{i} in a pipeline template $G^{\myLambda,\myGamma}$ specify the data protection requirements that a candidate service must fulfill to be selected in the pipeline instance. Section~\ref{sec:instance} describes the selection process and the pipeline instance generation.

\subsection{Functional Annotations}\label{sec:funcannotation}
A proper data management approach must track functional data manipulations across the entire pipeline execution, defining the functional requirements of each service operating on data.
To this aim, each vertex \vi{i}$\in\V_S$ is annotated with a label \myGamma(\vi{i}), corresponding to the functional description $F_i$ of the service $s_i$ represented by \vi{i}.
$F_i$ describes the functional requirements on the corresponding service $s_i$, such as API, inputs, expected outputs.
%The latter is modeled as a functional transformation function \TF\ that is applied to the data when executing service $s_i$. \TF\ has a twofold role:
%\begin{enumerate}[label=\roman*)]
% \item it contains the functional requirements that the service must satisfy, in terms of expected input, expected output, prototype and other functional aspects.
$F_i$ describes the functional requirements, such as API, inputs, expected outputs.
It also specifies a set \TF{} of data transformation functions \tf{i}, which can be triggered during the execution of the corresponding service $s_i$.

Function $\tf{i}$$\in$$\TF{}$ can be:
Expand Down
Loading

0 comments on commit 3eaa14e

Please sign in to comment.