Skip to content

Commit

Permalink
sistemato fino a sezione 4 inclusa
Browse files Browse the repository at this point in the history
  • Loading branch information
cb-unimi committed May 5, 2024
1 parent f92ee30 commit c5940c0
Show file tree
Hide file tree
Showing 6 changed files with 80 additions and 79 deletions.
33 changes: 21 additions & 12 deletions introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -7,21 +7,30 @@ \section{Introduction}
As a consequence, achieving a balance between data protection and data quality is crucial, as the removal or alteration of personally identifiable information from datasets to safeguard individuals' privacy can compromise the accuracy of analytics results.

So far, all research endeavors have been concentrated on exploring these two issues separately: on one hand, the concept of data quality, encompassing accuracy, reliability, and suitability, has been investigated to understand the implications in analytical contexts. Although extensively studied, these investigations often prioritize enhancing the quality of source data rather than ensuring data quality throughout the entire processing pipeline, or the integrity of outcomes derived from data. On the other hand, there is a focus on data privacy and security, entailing the protection of confidential information and adherence to rigorous privacy regulations.
There are very few solutions that that find a good balance between them since it requires a holistic approach that integrates technological solutions, organizational policies, and ongoing monitoring and adaptation to emerging threats and regulatory changes. A valid solution should implement robust access control mechanisms, ensuring that only authorized users can access specific datasets or analytical tools.
Additionally, data protection requirements should be identified at each stage of the data lifecycle, potentially incorporating techniques like data masking and anonymization to safeguard sensitive information by substituting it with realistic but fictional data, thereby preserving data privacy while enabling analysis. An ideal solution should prioritize data lineage, fostering a comprehensive understanding and optimization of data flows and transformations within complex analytical ecosystems.

To this aim, we propose a data governance framework tailored to contemporary data-driven pipelines, which aims to limit the privacy and security risks. The primary objective of this framework is to facilitate the assembly of data processing services, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements.
A valid solution requires a holistic approach that integrates technological solutions, organizational policies, and ongoing monitoring and adaptation to emerging threats and regulatory changes. The implementation of robust access control mechanisms, ensuring that only authorized users can access specific datasets or analytical tools is just an essential initial step.
Indeed, we identified some additional requirements. First, data protection requirements should be identified at each stage of the data lifecycle, potentially incorporating techniques like data masking and anonymization to safeguard sensitive information, thereby preserving data privacy while enabling sharing and analysis. Second, data lineage should be prioritized, fostering a comprehensive understanding and optimization of data flows and transformations within complex analytical ecosystems.

Each service in the pipeline is tagged with \textit{annotations} to specify data protection requirements expressing transformation on data to enforce data protection, as well as functional specifications on services expressing data manipulations carried out during services execution. There is a catalog of services among which a user can choose. Services may be functionally equivalent (i.e., they perform the same task), but have different security policies, more or less restrictive\st{, depending on the provider's attributes}. Thus a user si trova a avere tutto l'interesse di capire come le sue scelte service selection possano avere un impatto sulla qualità del risultato finale. Our service selection approach focuses on maximizing data quality by retaining the maximum amount of information when applying data protection transformations;


The key contributions are as follows:
When evaluating a solution meeting these criteria, the following questions naturally arise:
\begin{enumerate}
\item Service enrichment (metadati)
\item The composite selection problem is NP-hard, but we present a parametric heuristic tailored to address the computational
complexity. We evaluated the performance and quality of the algorithm by running some experiments on a dataset.
\item How does a robust data protection policy affect analytics?
%What impact does a robust and strong data protection policy have on analytics?
%How does a strong data protection policy influence analytics?
\item When considering a pipeline, is it more advantageous to perform data protection transformations at each step rather than filtering all data at the outset?
\item In a scenario where a user has the option to choose among various candidate services, how might these choices affect the analytics?
\end{enumerate}
%
Based on the aforementioned considerations, we propose a data governance framework customized for modern data-driven pipelines, designed to mitigate privacy and security risks. The primary objective of this framework is to facilitate the assembly of data processing services, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements.

The rest of the paper is organized as follows. %In Section \ref{sec:requirements} In Section \ref{} In Section \ref{}
In our solution, each element in the pipeline is tagged with \textit{annotations} to specify data protection requirements expressing transformation on data to enforce data protection, as well as functional specifications on services expressing data manipulations carried out during services execution. For each element in the pipeline, there is a catalog of candidate services among which the user running it can choose. Services may be functionally equivalent (i.e., they perform the same tasks), but have different security policies, more or less restrictive depending on the organization or service provider they belong to.
Our data governance approach is based on the belief that maintaining a larger volume of data leads to higher data quality, thus our service selection approach focuses on maximizing data quality by retaining the maximum amount of information when applying data protection transformations.
We have a running example on which we conducted experiments to provide an initial answer to the questions highlighted earlier.

ecosystem of services
The primary contributions of the paper can be summarized as follows:
\begin{enumerate}
\item Enriching services with meta-data that describes both data protection and functional requirements;
\item Proposing a parametric heuristic tailored to address the computational complexity of the NP-hard service selection problem;
\item Evaluating the performance and quality of the algorithm through experiments conducted using the dataset from the running example.
\end{enumerate}
%
The remainder of the paper is structured as follows: Section 2, presents our system model, illustrating a reference scenario where data is owned by multiple organizations. In Section \ref{sec:template}, we introduce the pipeline template and describe the annotations. In Section \ref{sec:instance}, we describe the process of template instantiaton. In Section \ref{sec:heuristics}, we introduce the quality metrics used by the selection algorithm and the heuristic solving the service selection problem. Section \ref{sec:experiment} outlines the experiments conducted. Section \ref{sec:related} discusses the existing solutions. Section \ref{sec:conclusions} concludes the paper.
3 changes: 2 additions & 1 deletion main.tex
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,8 @@
\input{experiment}
\input{related}

\section{Conclusions}
\section{Conclusions}\label{sec:conclusions}

\clearpage
%\bibliographystyle{spbasic} % basic style, author-year citations
\bibliographystyle{spmpsci} % mathematics and physical sciences
Expand Down
2 changes: 1 addition & 1 deletion pipeline_instance.tex
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ \section{Pipeline Instance}\label{sec:instance}
\begin{enumerate*}[label=\textit{\roman*})]
\item $v'_r$$=$$v_r$;
\item for each vertex $\vi{f}$ modeling a parallel structure, there exists a corresponding vertex $\vii{f}$;
\item for each $\vi{i}$$\in$$\V_S$ annotated with policy \P{i} (label \myLambda(\vi{i})) and functional description $F_i$(label \myGamma(\vi{i})), there exists a corresponding vertex \vii{i}$\in$$\Vp_S$ instantiated with a service \sii{i}, such that:
\item for each $\vi{i}$$\in$$\V_S$ annotated with policy \P{i} (label \myLambda(\vi{i})) and functional description $F_i$ (label \myGamma(\vi{i})), there exists a corresponding vertex \vii{i}$\in$$\Vp_S$ instantiated with a service \sii{i}, such that:
\end{enumerate*}
\begin{enumerate}[label=\arabic*)]
\item $s'_i$ satisfies data protection annotation \myLambda(\vi{i}) in \tChartFunction;
Expand Down
22 changes: 12 additions & 10 deletions pipeline_template.tex
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
\section{Pipeline Template}\label{sec:template}
Our approach integrates data protection and data management into the service pipeline using annotations.
To this aim, we extend the service pipeline in \cref{def:pipeline} with: \emph{i)} data protection annotations to express transformations on data, ensuring compliance with data protection requirements, \emph{ii)} functional annotations to express data manipulations carried out during services execution.
To this aim, we extend the service pipeline in \cref{def:pipeline} with: \emph{i)} data protection annotations that also express transformations on data, ensuring compliance with data protection requirements, \emph{ii)} functional annotations to express data manipulations carried out during services execution.
These annotations enable the implementation of an advanced data lineage, tracking the entire data lifecycle by monitoring changes that result from functional service execution and data protection requirements.

In the following, we first introduce the annotated service pipeline, called pipeline template (Section \ref{sec:templatedefinition}). We then present functional annotations (Section \ref{sec:funcannotation}) and data protection annotations (Section \ref{sec:nonfuncannotation}). We finally provide an example of a pipeline template (Section \ref{sec:example_template}).
In the following, we first introduce the annotated service pipeline, called pipeline template (Section \ref{sec:templatedefinition}). Then, we present both functional annotations (Section \ref{sec:funcannotation}) and data protection annotations (Section \ref{sec:nonfuncannotation}), providing an example of a pipeline template in the context of the reference scenario.


\subsection{Pipeline Template Definition}\label{sec:templatedefinition}
Expand All @@ -17,16 +17,16 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition}
The template is formally defined as follows.

\begin{definition}[Pipeline Template] \label{def:template}
Given a service pipeline G(\V,\E), a pipeline template $G^{\myLambda,\myGamma}$(V,E,\myLambda,\myGamma) is a direct acyclic graph extedend with two annotation functions:
Given a service pipeline G(\V,\E), a pipeline template \tChartFunction is a direct acyclic graph extedend with two annotation functions:
\begin{enumerate}%[label=\textit{\roman*}]
\item \emph{Data Protection Annotation} \myLambda that assigns a label \myLambda(\vi{i}) to each vertex $\vi{i}\in\V_S$. Label \myLambda(\vi{i}) corresponds to a set \P{i} of policies $p_j$ to be satisfied by service $s_i$ represented by \vi{i}, ;
\item \emph{Data Protection Annotation} \myLambda that assigns a label \myLambda(\vi{i}) to each vertex $\vi{i}\in\V_S$. Label \myLambda(\vi{i}) corresponds to a set \P{i} of policies $p_j$ to be satisfied by service $s_i$ represented by \vi{i};
\item \emph{Functional Annotation} \myGamma that assigns a label \myGamma(\vi{i}) to each vertex $\vi{i}\in\V_S$. Label \myGamma(\vi{i}) corresponds to the functional description $F_i$ of service $s_i$ represented by \vi{i}.
\end{enumerate}
\end{definition}

We note that, at this stage, the template is not yet linked to any service. We also note that policies $p_j$$\in$\P{i} in \myLambda(\vi{i}) are combined using logical OR, meaning that the access decision is positive if at least one policy $p_j$ evaluates to \emph{true}.
%We also note that functional description $F_i$ includes the specific data transformation triggered as the result of a service execution.
An example of pipeline template is depicted in \cref{fig:service_composition_template}
The pipeline template of the service pipeline of \cref{fig:reference_scenario} is depicted in \cref{fig:service_composition_template}.

%The next sections better explain the functional and non-functional transformation functions.
\begin{figure}[ht!]
Expand Down Expand Up @@ -102,7 +102,7 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition}
A {\it policy p}$\in$\P{} is 5-uple $<$\textit{subj}, \textit{obj}, \textit{act}, \textit{env}, \textit{\TP}$>$ that specifies who (\emph{subject}) can access what (\emph{object}) with action (\emph{action}), in a specific context (\emph{environment}) and under specific obligations (\emph{data transformation}).
\end{definition}

More in detail, \textit{subject subj} defines a service $s_i$ issuing an access request to perform an action on an object. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, $<$\{(classifier $=$ "SVM")\}$>$ specifies a service providing a SVM classifier. We note that \textit{subj} can also specify conditions on the service owner ($<$\{(owner\_location $=$ "EU")\}$>$) and the service user ($<$\emph{service},\{(service\_user\_role $=$ "DOC Director")\}$>$).
More in detail, \textit{subject subj} specifies a service $s_i$ issuing an access request to perform an action on an object. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, $<$\{(classifier $=$ "SVM")\}$>$ specifies a service providing a SVM classifier. We note that \textit{subj} can also specify conditions on the service owner ($<$\{(owner\_location $=$ "EU")\}$>$) and the service user ($<$\emph{service},\{(service\_user\_role $=$ "DOC Director")\}$>$).

%\item
\textit{Object obj} defines those data whose access is governed by the policy. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}.
Expand All @@ -116,11 +116,11 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition}
\textit{Environment env} defines a set of conditions on contextual attributes, such as time of the day, location, IP address, risk level, weather condition, holiday/workday, emergency. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, $<$\{(time $=$ "night")\}$>$ refers to a policy that is applicable only at night.

%\item
\textit{Data Transformation \TP} defines a set of security and privacy-aware transformations on \textit{obj}, which must be enforced before any access to data. Transformations focus on data protection, as well as compliance to regulations and standards, in addition to simple format conversions. For instance, let us define three transformations that can be applied to the dataset in \cref{tab:dataset}:
\textit{Data Transformation \TP} defines a set of security and privacy-aware transformations on \textit{obj} that must be enforced before any access to data is given. Transformations focus on data protection, as well as on compliance to regulations and standards, in addition to simple format conversions. For instance, let us define three transformations that can be applied to the dataset in \cref{tab:dataset}:
\begin{enumerate*}[label=\roman*)]
\item \emph{level0} (\tp{0}): no anonymization;
\item \emph{level1} (\tp{1}): partial anonymization with only the first name and last name being anonymized;
\item \emph{level2} (\tp{2}): full anonymization with the first name, last name, identifier, and age being anonymized.
\item \emph{level1} (\tp{1}): partial anonymization with only first and last name being anonymized;
\item \emph{level2} (\tp{2}): full anonymization with first name, last name, identifier and age being anonymized.
\end{enumerate*}
%\end{description}

Expand Down Expand Up @@ -148,7 +148,9 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition}
% %\end{description}


To conclude, access control policies $p_j$$\in$\P{i} annotating vertex \vi{i} in a pipeline template $G^{\myLambda,\myGamma}$ filters out those candidate services $s$$\in$$S^c$ that do not match data protection requirements. Specifically, each policy $p_j$$\in$\P{i} verifies whether a candidate service $s$$\in$$S^c$ for vertex \vi{i} is compatible with data protection requirements in \myLambda(\vi{i}). Policy evaluation matches the profile \profile\ of a candidate service $s$$\in$$S^c$ with the policy conditions in each $p_j$$\in$\P{i}. If the credentials and declarations, defined as a set of attributes in the form (\emph{name}, \emph{value}), in the candidate service profile fails to meet the policy conditions, meaning that no policies $p_j$ evaluates to \emph{true}, the service is discarded; otherwise it is added to the set $S'$ of compatible service, which is used in Section~\ref{sec:instance} to generate the pipeline instance $G'$. No policy enforcement is done at this stage.
\textcolor{red}{Access control policies $p_j$$\in$\P{i} annotating a vertex \vi{i} in a pipeline template $G^{\myLambda,\myGamma}$ specify the data protection requirements that candidate service must fulfill in order to be selected in the pipeline instance. In Section~\ref{sec:instance} we describe the selection process.}

% To conclude, access control policies $p_j$$\in$\P{i} annotating vertex \vi{i} in a pipeline template $G^{\myLambda,\myGamma}$ filters out those candidate services $s$$\in$$S^c$ that do not match data protection requirements. Specifically, each policy $p_j$$\in$\P{i} verifies whether a candidate service $s$$\in$$S^c$ for vertex \vi{i} is compatible with data protection requirements in \myLambda(\vi{i}). Policy evaluation matches the profile \profile\ of a candidate service $s$$\in$$S^c$ with the policy conditions in each $p_j$$\in$\P{i}. If the credentials and declarations, defined as a set of attributes in the form (\emph{name}, \emph{value}), in the candidate service profile fails to meet the policy conditions, meaning that no policies $p_j$ evaluates to \emph{true}, the service is discarded; otherwise it is added to the set $S'$ of compatible service, which is used in Section~\ref{sec:instance} to generate the pipeline instance $G'$. No policy enforcement is done at this stage.

\subsection{Functional Annotations}\label{sec:funcannotation}
A proper data management approach must track functional data manipulations across the entire pipeline execution, defining the functional requirements of each service operating on data.
Expand Down
Loading

0 comments on commit c5940c0

Please sign in to comment.